Toxicity Detection
How Protecto analyzes text for harmful content and returns structured safety scores — without modifying or blocking data.
Toxicity detection analyzes text for harmful, abusive, or unsafe content. It is designed to support AI safety, moderation, and compliance workflows without altering the underlying data.
Toxicity detection is additive metadata — it adds context without changing behavior.
When toxicity detection runs
Toxicity detection can run during:
- Masking — when text is analyzed and sensitive data is tokenized
- Unmasking — when original values are resolved from tokens
Whether toxicity detection runs is entirely policy-controlled. If the policy does not enable toxicity detection, no toxicity data is returned.
Toxicity categories
Toxicity detection evaluates content across six categories. Each is returned as a score between 0 and 1, where higher values indicate a higher likelihood.
| Category | Description |
|---|---|
toxicity | Overall likelihood that the content is toxic |
severe_toxicity | Likelihood of extreme or highly harmful content |
obscene | Presence of obscene or explicit language |
threat | Likelihood of violent or threatening language |
insult | Likelihood of insulting or abusive language |
identity_attack | Attacks targeting a protected group |
These scores are probabilistic signals, not binary judgments.
How toxicity data is returned
When enabled, toxicity detection results are returned as part of the API response, alongside masked or unmasked data.
Key characteristics:
- Scores do not affect masking or unmasking behavior
- No content is modified or blocked by Protecto
- Scores are returned per request
- Nothing is stored automatically
Your application decides how to use the scores.
Common use cases
Toxicity detection is typically used for:
- Moderating AI-generated responses
- Flagging unsafe user input before it enters a workflow
- Auditing LLM interactions for safety compliance
- Applying escalation or human review workflows
- Logging content safety signals alongside masked data
It is especially useful when working with GenAI systems where sensitive data and unsafe language may coexist in the same payload.
What toxicity detection does not do
Toxicity detection does not:
- Mask or redact content
- Prevent unmasking
- Enforce moderation decisions
- Replace application logic
It provides signals. You control outcomes.
Mental model: Think of toxicity detection as "a safety signal that travels alongside your data." It adds context without changing behavior — Protecto reports, your system decides.
Last updated 3 weeks ago
Built with Documentation.AI