Toxicity DetectionCategories

Toxicity Categories

The six toxicity categories Protecto reports — what each measures and how to use them in your moderation and safety workflows.

Protecto reports six toxicity categories for every analyzed text. All fields are always present in the response, even when scores are near zero.

Categories

Field NameDescription
toxicityOverall toxicity score for the text
severe_toxicityHighly aggressive or extreme toxicity
obsceneProfanity or sexually explicit language
threatDirect or indirect threats of harm
insultDerogatory or demeaning language
identity_attackAttacks targeting a protected group or identity

How categories relate

The categories are independent — high scores on one category do not imply high scores on others.

For example:

  • Content can score high on insult while scoring near zero on threat
  • Content can be obscene without being an identity_attack
  • toxicity captures overall toxicity and may be elevated even when specific sub-categories are low

Using multiple categories together

Many moderation workflows combine categories:

flag if toxicity > 0.7 OR identity_attack > 0.4 OR threat > 0.5
escalate if severe_toxicity > 0.3
log if any category > 0.2

Use identity_attack specifically for detecting hate speech and discriminatory content. It is designed to catch language targeting people based on race, religion, gender, sexual orientation, or other identity characteristics.