- Context-Aware Detection
- Content Safety
- Configurable Sensitivity
Toxicity manifests in many forms beyond explicit profanity: veiled threats, passive-aggressive hostility, dehumanizing language, and harassment that evades keyword filters. The toxic language guardrail uses LLMs to understand these nuances, detecting both obvious hate speech and subtle harmful communication.
How Toxic Language Detection Works
Understanding the detection process helps you configure the guardrail effectively:Content submission with configuration
You send text to the toxic language guardrail along with your configuration: sensitivity level and optional model settings. The configuration determines how strict the validation will be.
LLM semantic analysis
The guardrail sends your text to an LLM specifically instructed to identify harmful content including hate speech, threats, severe insults, harassment, and toxic communication patterns. The LLM analyzes meaning and intent, not just keywords.This semantic understanding catches toxicity in subtle forms: coded language, dogwhistles, passive-aggressive hostility, and context-dependent threats that keyword filters miss entirely.
Severity evaluation
The guardrail evaluates content against the configured sensitivity level. Low sensitivity flags only severe violations, medium catches clear hostility, and high sensitivity flags any potentially harmful language including mild rudeness.
Result with confidence and explanation
The guardrail returns a result indicating pass, fail, or unsure. It includes a confidence score and detailed explanation of what toxicity was detected.Security note: Log the detailed reason internally for analysis, but never expose it to end users. Detailed feedback helps attackers learn to evade detection.
Automatic observability
Every validation automatically creates an observation in ABV capturing input, result, configuration, and performance metrics. This data helps you tune sensitivity settings and identify patterns in violations.
Configuration
Configure the toxic language guardrail to match your content moderation needs:- TypeScript/JavaScript
- Python
| Option | Type | Default | Description |
|---|---|---|---|
sensitivity | SensitivityLevel | MEDIUM | Validation strictness (LOW, MEDIUM, HIGH) |
model | string | "anthropic/claude-3-5-haiku-20241022" | LLM model for validation |
temperature | number | 0.1 | LLM temperature (lower = more consistent) |
maxTokens | number | 200 | Maximum tokens in validation response |
Sensitivity Levels
Choose the appropriate sensitivity level based on your application context and audience:- LOW
- MEDIUM (Default)
- HIGH
Flags only severe violations like explicit threats and extreme hate speechUse for: Public forums where disagreement and strong opinions are expected, debate platforms, news comment sections.Validation Behavior:
- Blocks explicit threats of violence
- Flags extreme hate speech targeting protected groups
- Allows heated debate and strong criticism
- Permits profanity in non-threatening contexts
- Focuses on preventing clearly dangerous content
Response Format
The guardrail returns a structured response:- TypeScript/JavaScript
- Python
Common Use Cases
Content Moderation in Forums and Chat
Content Moderation in Forums and Chat
Real-time validation of user messages in chat applications, forums, and community platforms. Screen messages before display to prevent harassment and maintain community standards.
User Input Validation Before LLM Processing
User Input Validation Before LLM Processing
Screen user inputs before sending to your LLM to prevent prompt injection attacks that include toxic content and to avoid processing abusive requests.
AI Response Validation Before Delivery
AI Response Validation Before Delivery
Validate LLM-generated responses before showing them to users. Even well-trained models can occasionally produce harmful content.
Support Ticket Screening
Support Ticket Screening
Screen incoming support tickets to identify abusive messages, prioritize responses, and protect support staff from harassment.
Comment Filtering on User-Generated Content
Comment Filtering on User-Generated Content
Filter comments on blog posts, product reviews, and other user-generated content to maintain a positive community environment.
Combining with Other Guardrails
Toxic language detection often works alongside other content validation: With biased language detection: Validate for both toxicity and bias to ensure content is both safe and inclusive. Run these checks in parallel since they analyze different aspects of the same content. With contains string: Use contains string to quickly block explicitly prohibited terms (instant, free) before running toxic language detection (1-3 seconds, token cost). This pre-filtering reduces costs. With valid JSON: When processing structured input, validate JSON format first (instant), then check text fields for toxicity.Next Steps
Biased Language
Detect discriminatory content and stereotypes that complements toxicity detection
Best Practices
Learn optimal patterns for combining guardrails, error handling, and cost management
Concepts
Understand confidence scores, decision strategies, and validation patterns in depth
Quickstart
Get hands-on with guardrails in under 5 minutes