Skip to main content
Detecting toxic language in LLM applications requires understanding context, intent, and severity. The toxic language guardrail uses LLM-powered semantic analysis to identify harmful content that simple keyword filtering would miss.
Toxicity manifests in many forms beyond explicit profanity: veiled threats, passive-aggressive hostility, dehumanizing language, and harassment that evades keyword filters. The toxic language guardrail uses LLMs to understand these nuances, detecting both obvious hate speech and subtle harmful communication.

How Toxic Language Detection Works

Understanding the detection process helps you configure the guardrail effectively:

Content submission with configuration

You send text to the toxic language guardrail along with your configuration: sensitivity level and optional model settings. The configuration determines how strict the validation will be.

LLM semantic analysis

The guardrail sends your text to an LLM specifically instructed to identify harmful content including hate speech, threats, severe insults, harassment, and toxic communication patterns. The LLM analyzes meaning and intent, not just keywords.This semantic understanding catches toxicity in subtle forms: coded language, dogwhistles, passive-aggressive hostility, and context-dependent threats that keyword filters miss entirely.

Severity evaluation

The guardrail evaluates content against the configured sensitivity level. Low sensitivity flags only severe violations, medium catches clear hostility, and high sensitivity flags any potentially harmful language including mild rudeness.

Result with confidence and explanation

The guardrail returns a result indicating pass, fail, or unsure. It includes a confidence score and detailed explanation of what toxicity was detected.Security note: Log the detailed reason internally for analysis, but never expose it to end users. Detailed feedback helps attackers learn to evade detection.

Automatic observability

Every validation automatically creates an observation in ABV capturing input, result, configuration, and performance metrics. This data helps you tune sensitivity settings and identify patterns in violations.

Configuration

Configure the toxic language guardrail to match your content moderation needs:
interface ToxicLanguageConfig {
  sensitivity?: SensitivityLevel;  // LOW, MEDIUM, HIGH
  model?: string;                   // LLM model for validation
  temperature?: number;             // Lower = more consistent
  maxTokens?: number;               // Max tokens in response
}

// Default configuration
await abv.guardrails.toxicLanguage.validate(text, {
  sensitivity: "medium",
  model: "anthropic/claude-3-5-haiku-20241022",
  temperature: 0.1,
  maxTokens: 200
});
OptionTypeDefaultDescription
sensitivitySensitivityLevelMEDIUMValidation strictness (LOW, MEDIUM, HIGH)
modelstring"anthropic/claude-3-5-haiku-20241022"LLM model for validation
temperaturenumber0.1LLM temperature (lower = more consistent)
maxTokensnumber200Maximum tokens in validation response

Sensitivity Levels

Choose the appropriate sensitivity level based on your application context and audience:
Flags only severe violations like explicit threats and extreme hate speechUse for: Public forums where disagreement and strong opinions are expected, debate platforms, news comment sections.Validation Behavior:
  • Blocks explicit threats of violence
  • Flags extreme hate speech targeting protected groups
  • Allows heated debate and strong criticism
  • Permits profanity in non-threatening contexts
  • Focuses on preventing clearly dangerous content

Response Format

The guardrail returns a structured response:
interface ToxicLanguageResult {
  status: "pass" | "fail" | "unsure";
  reason: string;           // Detailed explanation (log internally only)
  confidence: number;       // 0-1 confidence score
  sensitivity: SensitivityLevel;
}

const result = await abv.guardrails.toxicLanguage.validate(userMessage);

if (result.status === "pass") {
  // Content is safe to display
  displayMessage(userMessage);
} else if (result.status === "fail") {
  // Block content, show generic message to user
  showWarning("This message violates community guidelines.");
  // Log details for moderation review
  logViolation(result.reason, result.confidence);
} else {
  // Unsure - queue for human review
  queueForReview(userMessage, result);
}

Common Use Cases

Real-time validation of user messages in chat applications, forums, and community platforms. Screen messages before display to prevent harassment and maintain community standards.
async function moderateMessage(message: string, channelType: string) {
  const sensitivity = channelType === "kids" ? "high" : "medium";

  const result = await abv.guardrails.toxicLanguage.validate(message, {
    sensitivity
  });

  return result.status === "pass";
}
Screen user inputs before sending to your LLM to prevent prompt injection attacks that include toxic content and to avoid processing abusive requests.
async def process_user_request(user_input: str):
    # Check for toxic content before LLM processing
    validation = await abv.guardrails.toxic_language.validate_async(
        user_input,
        {"sensitivity": "medium"}
    )

    if validation["status"] != "pass":
        return {"error": "Please rephrase your request respectfully."}

    # Safe to process
    return await call_llm(user_input)
Validate LLM-generated responses before showing them to users. Even well-trained models can occasionally produce harmful content.
async function generateSafeResponse(prompt: string): Promise<string> {
  const response = await callLLM(prompt);

  const validation = await abv.guardrails.toxicLanguage.validate(response, {
    sensitivity: "high"
  });

  if (validation.status === "fail") {
    // Regenerate with safety instruction
    return await callLLM(
      prompt + "\n\nRespond helpfully and respectfully."
    );
  }

  return response;
}
Screen incoming support tickets to identify abusive messages, prioritize responses, and protect support staff from harassment.
async def screen_support_ticket(ticket_content: str):
    result = await abv.guardrails.toxic_language.validate_async(
        ticket_content,
        {"sensitivity": "medium"}
    )

    if result["status"] == "fail":
        return {
            "flagged": True,
            "route_to": "supervisor",
            "confidence": result["confidence"]
        }

    return {"flagged": False, "route_to": "standard_queue"}
Filter comments on blog posts, product reviews, and other user-generated content to maintain a positive community environment.
async function filterComment(comment: string, contentType: string) {
  const result = await abv.guardrails.toxicLanguage.validate(comment, {
    sensitivity: contentType === "product_review" ? "medium" : "high"
  });

  if (result.status === "pass") {
    return { approved: true };
  }

  if (result.confidence > 0.9) {
    return { approved: false, reason: "Comment violates guidelines" };
  }

  // Low confidence - queue for moderation
  return { approved: false, needsReview: true };
}

Combining with Other Guardrails

Toxic language detection often works alongside other content validation: With biased language detection: Validate for both toxicity and bias to ensure content is both safe and inclusive. Run these checks in parallel since they analyze different aspects of the same content. With contains string: Use contains string to quickly block explicitly prohibited terms (instant, free) before running toxic language detection (1-3 seconds, token cost). This pre-filtering reduces costs. With valid JSON: When processing structured input, validate JSON format first (instant), then check text fields for toxicity.

Next Steps