Core ConceptsToxicity Detection

Toxicity Detection

How Protecto analyzes text for harmful content and returns structured safety scores — without modifying or blocking data.

Toxicity detection analyzes text for harmful, abusive, or unsafe content. It is designed to support AI safety, moderation, and compliance workflows without altering the underlying data.

Toxicity detection is additive metadata — it adds context without changing behavior.

When toxicity detection runs

Toxicity detection can run during:

  • Masking — when text is analyzed and sensitive data is tokenized
  • Unmasking — when original values are resolved from tokens

Whether toxicity detection runs is entirely policy-controlled. If the policy does not enable toxicity detection, no toxicity data is returned.

Toxicity categories

Toxicity detection evaluates content across six categories. Each is returned as a score between 0 and 1, where higher values indicate a higher likelihood.

CategoryDescription
toxicityOverall likelihood that the content is toxic
severe_toxicityLikelihood of extreme or highly harmful content
obscenePresence of obscene or explicit language
threatLikelihood of violent or threatening language
insultLikelihood of insulting or abusive language
identity_attackAttacks targeting a protected group

These scores are probabilistic signals, not binary judgments.

How toxicity data is returned

When enabled, toxicity detection results are returned as part of the API response, alongside masked or unmasked data.

Key characteristics:

  • Scores do not affect masking or unmasking behavior
  • No content is modified or blocked by Protecto
  • Scores are returned per request
  • Nothing is stored automatically

Your application decides how to use the scores.

Common use cases

Toxicity detection is typically used for:

  • Moderating AI-generated responses
  • Flagging unsafe user input before it enters a workflow
  • Auditing LLM interactions for safety compliance
  • Applying escalation or human review workflows
  • Logging content safety signals alongside masked data

It is especially useful when working with GenAI systems where sensitive data and unsafe language may coexist in the same payload.

What toxicity detection does not do

Toxicity detection does not:

  • Mask or redact content
  • Prevent unmasking
  • Enforce moderation decisions
  • Replace application logic

It provides signals. You control outcomes.

Mental model: Think of toxicity detection as "a safety signal that travels alongside your data." It adds context without changing behavior — Protecto reports, your system decides.