Skip to main content
Understanding a few fundamental concepts will help you use guardrails effectively. This guide explains how guardrails work, what their results mean, and how to make decisions based on those results.

How Guardrails Make Decisions

Guardrails fall into two categories based on how they analyze content:
LLM-powered guardrails use language models to understand context, nuance, and intent.When you send text to an LLM-powered guardrail like toxic language or biased language, it asks a language model to analyze the content and make a judgment. This takes about 1-3 seconds but gives you sophisticated analysis that understands sarcasm, coded language, and cultural context.Example: “People like you are the problem” → LLM recognizes this as hostile even without explicit profanity
Key difference: LLM-powered guardrails understand meaning while rule-based guardrails match patterns. If someone writes “people like you are the problem,” an LLM-powered guardrail recognizes this as hostile even though it doesn’t contain explicit profanity. A rule-based guardrail would only catch it if you explicitly listed that exact phrase.

Understanding Results

Every guardrail returns a result with three essential pieces of information:
The status field tells you the outcome of validation:
  • pass - Content meets your validation criteria and is safe to use
  • fail - Content violates your criteria and should be blocked or regenerated
  • unsure - The guardrail cannot make a confident determination (AI-powered only)
The unsure status only appears with LLM-powered guardrails since rule-based guardrails make binary decisions. You’ll see unsure when content is genuinely ambiguous or sits right on the boundary between acceptable and unacceptable.Examples of unsure cases:
  • Mild sarcasm that’s hard to judge definitively
  • Comments that could be critical feedback or personal attacks
  • Context-dependent language without enough context
The confidence field indicates how certain the guardrail is about its decision:Rule-based guardrails always return 1.0 because their logic is deterministic—there’s no uncertainty when checking if a string contains another string or if JSON is valid.LLM-powered guardrails return variable confidence:
  • 0.9-1.0 (Very high) - Clear, unambiguous indicators
  • 0.7-0.9 (High) - Strong evidence with minor ambiguity
  • 0.5-0.7 (Moderate) - Notable ambiguity in the case
  • < 0.5 (Low) - Borderline, difficult to judge definitively
Factors affecting confidence:
  • Content with obvious violations or clear acceptability → High confidence
  • Ambiguous phrasing, sarcasm, context-dependent meaning → Lower confidence
  • Very short text with little context → Lower confidence
  • Cultural or linguistic nuances → May reduce confidence
The reason field provides a human-readable explanation of why the guardrail made its decision.Use cases:
  • Internal logging and debugging
  • Analyzing patterns in your dashboard
  • Understanding edge cases
  • Tuning your configuration
Security note: Never expose the detailed reason to end users. This information helps attackers understand your validation logic so they can evade it. Use generic error messages for users while logging detailed reasons internally.

How Guardrails Process Content

Key differences:
  • Rule-based guardrails always return binary results (PASS/FAIL) with confidence 1.0
  • LLM-powered guardrails can return UNSURE status with variable confidence scores
  • Your application decides how to handle each result based on status and confidence

Common Usage Patterns

Understanding common patterns helps you apply guardrails effectively in your application:
Run guardrails before sending user content to your LLM. This protects your LLM from toxic prompts, prevents prompt injection attacks, and ensures you only process valid requests. Common pattern: check for forbidden strings first (instant), then run LLM-powered toxicity detection only if the content passes the initial screening.
Run guardrails after your LLM generates content but before showing it to users. This maintains brand safety, ensures compliance with regulations, and catches cases where the LLM generates unexpected content. Essential for customer-facing applications and regulated industries.
Use multiple guardrails in sequence for comprehensive protection. Start with fast rule-based checks to catch obvious problems, then run expensive LLM-powered checks only if content passes initial screening. This pattern minimizes cost while maintaining thorough validation.
Run multiple independent guardrails simultaneously to minimize latency. For example, checking for toxic language and biased language are independent analyses that can happen in parallel. Total time equals the slowest check, not the sum of all checks.

Sensitivity Levels

LLM-powered guardrails support sensitivity settings that control validation strictness. Choosing the appropriate level depends on your system’s risk classification under regulatory frameworks and the potential harm from content failures.

Regulatory Framework Mapping

Permissive validation—only severe violations trigger failuresRegulatory Context: Maps to Minimal/No-Risk AI Systems under the EU AI Act with no mandatory compliance obligations.EU AI Act Alignment:
  • Internal productivity tools
  • Non-critical recommendation systems
  • Entertainment and gaming applications
  • General-purpose utilities (spam filters, search)
ISO 42001 Requirements:
  • Voluntary codes of conduct
  • Basic ethical AI principles
  • Standard software development practices
  • No specific AI governance mandates
NIST AI RMF Considerations:
  • Minimal potential for harm to people
  • Low organizational impact
  • Limited ecosystem effects
  • Easily reversible outcomes
Validation Behavior:
  • Flags explicit threats and hate speech
  • Detects clear, unambiguous violations
  • Identifies severe discriminatory language
  • Allows robust debate and strong opinions
Compliance Note: Sensitivity level selection affects your ability to demonstrate compliance with regulatory requirements. High-risk LLM systems under the EU AI Act require stricter content controls and comprehensive audit trails. ABV automatically captures all validation results for compliance documentation.

Making Decisions with Results

Your decision logic determines how your application responds to validation results:

Simple Decision Strategy

// Simple approach using only status
if (result.status === "pass") {
  return { action: "allow", content };
}

if (result.status === "fail") {
  return { action: "block", message: "Content violates guidelines" };
}

// You choose how to handle "unsure"
// Conservative: treat as fail
// Permissive: treat as pass
// Balanced: flag for review
return { action: "review", content };

Sophisticated Decision Strategy

Use confidence scores for tiered responses:
// Tiered approach using status AND confidence
if (result.status === "pass") {
  return { action: "allow", content };
}

if (result.status === "fail" && result.confidence > 0.8) {
  // High-confidence failure: auto-block
  await logRejection(content, result.reason, "auto");
  return { action: "block", message: "Content violates guidelines" };
}

if (result.status === "fail" && result.confidence > 0.6) {
  // Medium-confidence failure: flag for review
  await queueForReview(content, result);
  return { action: "pending", message: "Content under review" };
}

// Low confidence or unsure: always review
await queueForReview(content, result);
return { action: "pending", message: "Content under review" };

Response Times and Costs

Understanding performance characteristics helps you build efficient validation pipelines:
Performance:
  • Response time: < 10 milliseconds
  • Cost: $0 (runs locally)
  • Predictable: Always same speed
Best for:
  • Pre-filtering before expensive checks
  • High-volume validation
  • Real-time validation
  • Patterns you can enumerate
Examples:
  • Contains String: Check for forbidden terms
  • Valid JSON: Validate structured outputs

Observations and Monitoring

Every guardrail execution automatically creates an observation in your ABV dashboard: What’s captured:
  • Input text (the content you validated)
  • Result (status, confidence, reason)
  • Configuration (sensitivity, mode, schema)
  • Performance (timing, token usage)
  • Context (user, session, trace)
How to use observations:
  • Track failure rates by guardrail type
  • See which content types cause the most failures
  • Identify trends in user behavior
  • Spot unusual spikes in violations
  • See where ambiguity occurs in your validation
  • Identify categories that need better rules
  • Understand when human review is needed most
  • Tune confidence thresholds for decisions
  • Too many false positives? Lower sensitivity
  • Harmful content slipping through? Raise sensitivity
  • Different sensitivities for different contexts
  • A/B test different sensitivity levels
  • Examine full context of a validation
  • Understand why specific content failed/passed
  • Reproduce issues for investigation
  • Improve your prompts based on patterns

Combining Multiple Guardrails

Most applications use multiple guardrails together:

Independent Guardrails (Run in Parallel)

Guardrails checking different criteria can run simultaneously:
// These checks are independent - run in parallel
const [toxicCheck, biasCheck] = await Promise.all([
  abv.guardrails.toxicLanguage.validate(content),
  abv.guardrails.biasedLanguage.validate(content)
]);

// Total time = slowest check (not sum of both)

Dependent Guardrails (Run Sequentially)

Create validation pipelines where fast checks filter before expensive checks:
// Sequential pipeline: fast check filters before expensive check
const quickCheck = await abv.guardrails.containsString.validate(content, {
  strings: ["forbidden", "banned", "prohibited"],
  mode: "none"
});

if (quickCheck.status === "fail") {
  return { valid: false };  // Failed quick check, skip expensive check
}

// Only run expensive LLM check if quick check passed
const deepCheck = await abv.guardrails.toxicLanguage.validate(content);
return { valid: deepCheck.status === "pass" };

Next Steps