Detecting toxic language in LLM applications requires understanding context, intent, and severity. The toxic language guardrail uses LLM-powered semantic analysis to identify harmful content that simple keyword filtering would miss.

Context-Aware Detection
Content Safety
Configurable Sensitivity

Toxicity manifests in many forms beyond explicit profanity: veiled threats, passive-aggressive hostility, dehumanizing language, and harassment that evades keyword filters. The toxic language guardrail uses LLMs to understand these nuances, detecting both obvious hate speech and subtle harmful communication.

How Toxic Language Detection Works

Understanding the detection process helps you configure the guardrail effectively:

Content submission with configuration

You send text to the toxic language guardrail along with your configuration: sensitivity level and optional model settings. The configuration determines how strict the validation will be.

LLM semantic analysis

The guardrail sends your text to an LLM specifically instructed to identify harmful content including hate speech, threats, severe insults, harassment, and toxic communication patterns. The LLM analyzes meaning and intent, not just keywords.This semantic understanding catches toxicity in subtle forms: coded language, dogwhistles, passive-aggressive hostility, and context-dependent threats that keyword filters miss entirely.

Severity evaluation

The guardrail evaluates content against the configured sensitivity level. Low sensitivity flags only severe violations, medium catches clear hostility, and high sensitivity flags any potentially harmful language including mild rudeness.

Result with confidence and explanation

The guardrail returns a result indicating pass, fail, or unsure. It includes a confidence score and detailed explanation of what toxicity was detected.Security note: Log the detailed reason internally for analysis, but never expose it to end users. Detailed feedback helps attackers learn to evade detection.

Automatic observability

Every validation automatically creates an observation in ABV capturing input, result, configuration, and performance metrics. This data helps you tune sensitivity settings and identify patterns in violations.

Configuration

Configure the toxic language guardrail to match your content moderation needs:

TypeScript/JavaScript
Python

interface ToxicLanguageConfig {
  sensitivity?: SensitivityLevel;  // LOW, MEDIUM, HIGH
  model?: string;                   // LLM model for validation
  temperature?: number;             // Lower = more consistent
  maxTokens?: number;               // Max tokens in response
}

// Default configuration
await abv.guardrails.toxicLanguage.validate(text, {
  sensitivity: "medium",
  model: "anthropic/claude-3-5-haiku-20241022",
  temperature: 0.1,
  maxTokens: 200
});

# Configuration options
config = {
    "sensitivity": "medium",  # LOW, MEDIUM, HIGH
    "model": "anthropic/claude-3-5-haiku-20241022",
    "temperature": 0.1,
    "max_tokens": 200
}

await abv.guardrails.toxic_language.validate_async(text, config)

Option	Type	Default	Description
`sensitivity`	`SensitivityLevel`	`MEDIUM`	Validation strictness (LOW, MEDIUM, HIGH)
`model`	`string`	`"anthropic/claude-3-5-haiku-20241022"`	LLM model for validation
`temperature`	`number`	`0.1`	LLM temperature (lower = more consistent)
`maxTokens`	`number`	`200`	Maximum tokens in validation response

Sensitivity Levels

Choose the appropriate sensitivity level based on your application context and audience:

LOW
MEDIUM (Default)
HIGH

Flags only severe violations like explicit threats and extreme hate speechUse for: Public forums where disagreement and strong opinions are expected, debate platforms, news comment sections.Validation Behavior:

Blocks explicit threats of violence
Flags extreme hate speech targeting protected groups
Allows heated debate and strong criticism
Permits profanity in non-threatening contexts
Focuses on preventing clearly dangerous content

Response Format

The guardrail returns a structured response:

TypeScript/JavaScript
Python

interface ToxicLanguageResult {
  status: "pass" | "fail" | "unsure";
  reason: string;           // Detailed explanation (log internally only)
  confidence: number;       // 0-1 confidence score
  sensitivity: SensitivityLevel;
}

const result = await abv.guardrails.toxicLanguage.validate(userMessage);

if (result.status === "pass") {
  // Content is safe to display
  displayMessage(userMessage);
} else if (result.status === "fail") {
  // Block content, show generic message to user
  showWarning("This message violates community guidelines.");
  // Log details for moderation review
  logViolation(result.reason, result.confidence);
} else {
  // Unsure - queue for human review
  queueForReview(userMessage, result);
}

result = await abv.guardrails.toxic_language.validate_async(user_message)

# Result structure:
# {
#   "status": "pass" | "fail" | "unsure",
#   "reason": "Detailed explanation",
#   "confidence": 0.95,
#   "sensitivity": "medium"
# }

if result["status"] == "pass":
    # Content is safe to display
    display_message(user_message)
elif result["status"] == "fail":
    # Block content, show generic message to user
    show_warning("This message violates community guidelines.")
    # Log details for moderation review
    log_violation(result["reason"], result["confidence"])
else:
    # Unsure - queue for human review
    queue_for_review(user_message, result)

Common Use Cases

Content Moderation in Forums and Chat

Real-time validation of user messages in chat applications, forums, and community platforms. Screen messages before display to prevent harassment and maintain community standards.

async function moderateMessage(message: string, channelType: string) {
  const sensitivity = channelType === "kids" ? "high" : "medium";

  const result = await abv.guardrails.toxicLanguage.validate(message, {
    sensitivity
  });

  return result.status === "pass";
}

User Input Validation Before LLM Processing

Screen user inputs before sending to your LLM to prevent prompt injection attacks that include toxic content and to avoid processing abusive requests.

async def process_user_request(user_input: str):
    # Check for toxic content before LLM processing
    validation = await abv.guardrails.toxic_language.validate_async(
        user_input,
        {"sensitivity": "medium"}
    )

    if validation["status"] != "pass":
        return {"error": "Please rephrase your request respectfully."}

    # Safe to process
    return await call_llm(user_input)

AI Response Validation Before Delivery

Validate LLM-generated responses before showing them to users. Even well-trained models can occasionally produce harmful content.

async function generateSafeResponse(prompt: string): Promise<string> {
  const response = await callLLM(prompt);

  const validation = await abv.guardrails.toxicLanguage.validate(response, {
    sensitivity: "high"
  });

  if (validation.status === "fail") {
    // Regenerate with safety instruction
    return await callLLM(
      prompt + "\n\nRespond helpfully and respectfully."
    );
  }

  return response;
}

Support Ticket Screening

Screen incoming support tickets to identify abusive messages, prioritize responses, and protect support staff from harassment.

async def screen_support_ticket(ticket_content: str):
    result = await abv.guardrails.toxic_language.validate_async(
        ticket_content,
        {"sensitivity": "medium"}
    )

    if result["status"] == "fail":
        return {
            "flagged": True,
            "route_to": "supervisor",
            "confidence": result["confidence"]
        }

    return {"flagged": False, "route_to": "standard_queue"}

Comment Filtering on User-Generated Content

Filter comments on blog posts, product reviews, and other user-generated content to maintain a positive community environment.

async function filterComment(comment: string, contentType: string) {
  const result = await abv.guardrails.toxicLanguage.validate(comment, {
    sensitivity: contentType === "product_review" ? "medium" : "high"
  });

  if (result.status === "pass") {
    return { approved: true };
  }

  if (result.confidence > 0.9) {
    return { approved: false, reason: "Comment violates guidelines" };
  }

  // Low confidence - queue for moderation
  return { approved: false, needsReview: true };
}

Combining with Other Guardrails

Toxic language detection often works alongside other content validation: With biased language detection: Validate for both toxicity and bias to ensure content is both safe and inclusive. Run these checks in parallel since they analyze different aspects of the same content. With contains string: Use contains string to quickly block explicitly prohibited terms (instant, free) before running toxic language detection (1-3 seconds, token cost). This pre-filtering reduces costs. With valid JSON: When processing structured input, validate JSON format first (instant), then check text fields for toxicity.

Next Steps

Biased Language

Detect discriminatory content and stereotypes that complements toxicity detection

Best Practices

Learn optimal patterns for combining guardrails, error handling, and cost management

Concepts

Understand confidence scores, decision strategies, and validation patterns in depth

Quickstart

Get hands-on with guardrails in under 5 minutes

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

Toxic Language

How Toxic Language Detection Works

Configuration

Sensitivity Levels

Response Format

Common Use Cases

Combining with Other Guardrails

Next Steps

Biased Language

Best Practices

Concepts

Quickstart

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​How Toxic Language Detection Works

​Configuration

​Sensitivity Levels

​Response Format

​Common Use Cases

​Combining with Other Guardrails

​Next Steps

Biased Language

Best Practices

Concepts

Quickstart

How Toxic Language Detection Works

Configuration

Sensitivity Levels

Response Format

Common Use Cases

Combining with Other Guardrails

Next Steps