Toxicity DetectionScoring System

Scoring System

How Protecto's toxicity scores are calculated, what the scale means, and how to set thresholds for your application.

Toxicity scores are floating-point values from 0.0 to 1.0, returned per-category for every analyzed text.

Score scale

Score rangeInterpretation
0.0 – 0.1Very low likelihood — content is generally safe
0.1 – 0.3Low likelihood — minor signals, likely benign
0.3 – 0.6Moderate — worth logging or reviewing
0.6 – 0.8High — strong signal, consider flagging
0.8 – 1.0Very high — strong presence of the category

These ranges are guidelines. Protecto enforces no thresholds — you define what score triggers which action.

{
  "toxicity": 0.9648129,
  "severe_toxicity": 0.011925396,
  "obscene": 0.39630863,
  "threat": 0.0010725626,
  "insult": 0.9019991,
  "identity_attack": 0.0001435065
}

Common patterns for using these scores in application logic:

Use caseExample threshold
Audit loggingtoxicity > 0.3
User warningtoxicity > 0.6
Block submissiontoxicity > 0.85
Escalate to human reviewinsult > 0.7 OR identity_attack > 0.5

Scores reflect full context

Toxicity scores reflect the entire sentence, not just individual tokens. This means:

  • Entity masking does not affect the score — it runs on original content
  • Scores account for context, not just keywords
  • The same word in different sentences may score differently

Scores are probabilistic signals, not definitive judgments. A high threat score doesn't always mean a literal threat — it means the language pattern resembles threatening language with a given probability.