Skip to main content

How Observability & Tracing Works

Observability is the ability to understand what’s happening inside your LLM application by examining its external outputs. Traces are the fundamental units of observability—individual records capturing complete LLM interactions from start to finish.

Install the SDK

Add ABV to your application using the Python or JavaScript/TypeScript SDK. The SDK automatically instruments common LLM libraries (OpenAI, Anthropic, LangChain, LlamaIndex, etc.) with zero manual tracing code required for most use cases.
pip install abvdev  # Python
npm install @abvdev/tracing @abvdev/otel  # JavaScript/TypeScript

Initialize with Your API Key

Configure ABV with your project’s API key (found in the ABV Dashboard). The SDK connects to ABV’s backend and starts capturing telemetry automatically.
from abvdev import ABV

abv = ABV(api_key="your-api-key-here")

Run Your LLM Application

Make LLM calls as you normally would. ABV automatically captures:
  • Inputs: User queries, prompts, system instructions, few-shot examples
  • Outputs: LLM responses, tool calls, structured outputs
  • Metadata: Tokens (input/output), latency, costs, model parameters
  • Context: Sessions, users, environments, tags, custom metadata
from openai import OpenAI

client = OpenAI()

# ABV automatically traces this call
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

View Traces in the Dashboard

Navigate to the ABV Dashboard to explore traces. Each trace shows the complete interaction timeline, including:
  • Full prompt and response content
  • Token usage and costs broken down by model
  • Latency metrics (first token, total duration)
  • Tool/function calls with inputs and outputs
  • Error messages and stack traces for failures
Filter by user, session, environment, tags, or metadata to find specific interactions. Share trace URLs with teammates for collaborative debugging.

Build on Your Data

Once traces are flowing, leverage ABV’s advanced features:
  • Evaluations: Run systematic benchmarks on production data
  • Prompt Management: Version prompts and A/B test changes
  • Cost Tracking: Analyze spend by feature, user, or model
  • Guardrails: Catch problematic outputs before they reach users
Your traces become the foundation for continuous improvement.

Why You Should Set Up Tracing

Traditional debugging for LLM applications is painful. Print statements don’t work for complex agent workflows. Logs are scattered across services. Reproducing production issues locally is nearly impossible due to non-deterministic LLM behavior.With ABV tracing:
  • See the complete interaction timeline for any request, including retries, tool calls, and nested LLM calls
  • Replay production traces with identical inputs to reproduce bugs deterministically
  • Compare successful vs. failed traces side-by-side to identify root causes
  • Share trace URLs with teammates for instant context sharing
Example workflow:
  1. Customer reports an issue: ā€œThe chatbot gave me wrong refund informationā€
  2. Search traces by user ID to find their session
  3. Click the problematic trace to see the full conversation history
  4. Notice the RAG retrieval step pulled outdated policy documents
  5. Fix the retrieval logic and verify with new traces
What used to take hours of back-and-forth with customers now takes minutes.
LLM costs can spiral quickly, especially with production traffic. Without granular tracking, you don’t know:
  • Which features or workflows are most expensive
  • Whether users are exploiting your system with excessive requests
  • If a recent code change increased costs
  • How much each customer/tenant costs to serve
ABV automatically tracks:
  • Input/output tokens for every model call
  • Calculated costs using current provider pricing
  • Cost breakdowns by user, session, environment, model, or custom metadata
  • Cost trends over time with daily/weekly/monthly aggregations
Real-world savings:
  • Discovered a feature using GPT-4 when GPT-3.5 would suffice → 10x cost reduction
  • Identified users triggering infinite retry loops → prevented runaway costs
  • Found prompt inefficiencies (e.g., redundant context) → reduced tokens 30%
Set up usage alerts to get notified when costs exceed thresholds.
Latency matters for LLM applications. Users notice when responses are slow, and slow responses hurt conversion rates, engagement, and satisfaction.ABV helps you optimize:
  • Identify slow model calls, RAG retrievals, or tool executions
  • Measure time-to-first-token (streaming latency) vs. total duration
  • Compare latency across models (e.g., GPT-4 vs. Claude vs. Llama)
  • Track latency percentiles (p50, p95, p99) to catch tail latency issues
  • Correlate latency with cost to find the best price/performance tradeoff
Optimization patterns:
  • Use faster models for simple tasks (gpt-3.5-turbo vs. gpt-4)
  • Implement caching for repeated queries (ABV integrates with prompt caching)
  • Parallelize independent LLM/tool calls
  • Stream responses to improve perceived latency
  • Set aggressive timeouts to prevent hanging requests
Filter traces by latency ranges (e.g., >5 seconds) to find slow outliers.
You can’t improve what you don’t measure. Manual testing doesn’t scale, and gut feelings aren’t data. ABV’s evaluation tools require a foundation of production traces.Evaluations unlock:
  • Prompt versioning: Compare v1 vs. v2 on real user queries
  • Model comparisons: Test GPT-4 vs. Claude on your specific use case
  • Regression detection: Catch quality drops after code/prompt changes
  • A/B testing: Measure user satisfaction, conversion, or task success rates
  • LLM-as-a-Judge: Automatically score outputs for relevance, coherence, safety
Example evaluation workflow:
  1. Export 100 production traces from the past week
  2. Create a dataset in ABV Evaluations
  3. Run both the current prompt and a new candidate prompt
  4. Compare outputs using LLM-as-a-Judge scoring
  5. Promote the winning prompt to production with confidence
Without traces, you’re guessing. With traces, you’re measuring.
When customers report issues, support teams waste hours trying to reproduce problems or gather context. ā€œIt didn’t workā€ isn’t actionable without details.ABV transforms support:
  • Search traces by user ID, session ID, or custom metadata (e.g., order number)
  • See exactly what the user said, what the LLM responded, and what went wrong
  • Share trace URLs with engineering for instant context handoff
  • Filter by error status to proactively find and fix issues before users complain
  • Add comments on traces for internal notes and collaboration
Support playbook:
  1. Customer: ā€œI asked for a refund but got an errorā€
  2. Support agent searches traces by user email
  3. Finds the failed trace, sees the LLM tried to call a deprecated API
  4. Shares trace URL with engineering: ā€œThis API call is failingā€
  5. Engineering fixes the integration, support confirms with new traces
Time to resolution drops from days to hours. Customer satisfaction improves. Engineering gets actionable bug reports instead of vague complaints.

Core Tracing Features

Environments separate traces by deployment stage (development, staging, production). This prevents:
  • Test data polluting production metrics
  • Accidentally analyzing dev traces when investigating production issues
  • Mixing performance benchmarks across environments (dev is slower than prod)
Setting an environment:
import os

abv.init(
    api_key="...",
    environment=os.getenv("ENV", "development")  # e.g., "production"
)
Common environment names:
  • development: Local development machines
  • staging: Pre-production testing environment
  • production: Live user-facing application
  • ci: Continuous integration test runs
In the dashboard, filter by environment to compare performance or isolate issues. You can also set different sampling rates per environment (e.g., trace 100% in dev, 10% in production).See Environments for deployment strategies.
Metadata is arbitrary key-value data attached to traces. Use it to add context that makes filtering and analysis precise:
  • Model parameters: {"model": "gpt-4", "temperature": 0.7}
  • Business context: {"tenant_id": "acme-corp", "feature": "summarization"}
  • Deployment info: {"version": "1.2.3", "region": "us-west-2"}
  • User context: {"subscription_tier": "enterprise", "ab_test_group": "variant_b"}
Adding metadata:
with abv.observe(metadata={
    "tenant_id": "acme-corp",
    "feature": "document-qa",
    "model": "gpt-4",
    "version": "1.2.3"
}):
    response = client.chat.completions.create(...)
Querying by metadata: In the dashboard, filter traces with advanced queries:
  • metadata.tenant_id = "acme-corp" → All traces for a specific tenant
  • metadata.feature = "summarization" → All traces for a specific feature
  • metadata.version = "1.2.3" AND environment = "production" → Traces for a specific deployment
Metadata is indexed for fast querying. You can export filtered traces for evaluations or analysis.See Metadata for best practices and advanced patterns.
Tags are simple string labels that categorize traces for quick filtering. Unlike metadata (structured key-value pairs), tags are flat labels:
  • ["error", "timeout"] → Mark traces with failures
  • ["high-priority", "customer-facing"] → Prioritize critical workflows
  • ["experiment-v2", "canary"] → Flag experimental features
Adding tags:
with abv.observe(tags=["experiment-v2", "high-priority"]):
    response = client.chat.completions.create(...)
When to use tags vs. metadata:
  • Tags: Simple categorization, boolean flags (ā€œis this an error?ā€), filtering by presence
  • Metadata: Structured data with specific values (ā€œwhich tenant?ā€, ā€œwhat model version?ā€), filtering by exact match or range
In the dashboard, click a tag to filter all traces with that tag instantly. Tags appear as chips for visual scanning.See Tags for tagging strategies.
In distributed systems, a single user request might trigger:
  1. API gateway (validates request)
  2. Backend service (calls LLM)
  3. RAG service (retrieves documents)
  4. Database (logs results)
Without correlation, you can’t connect events across services. Trace IDs solve this by propagating a unique identifier through the entire request lifecycle.How it works:
  1. Generate a trace ID when the request enters your system
  2. Pass the trace ID to every downstream service (via HTTP headers, message queues, etc.)
  3. Each service logs events with the same trace ID
  4. ABV groups all events by trace ID for end-to-end visibility
Setting a custom trace ID:
import uuid

# Generate a trace ID (or extract from incoming request headers)
trace_id = str(uuid.uuid4())

# Pass to ABV
with abv.observe(trace_id=trace_id):
    response = client.chat.completions.create(...)
Distributed tracing with OpenTelemetry: ABV supports OpenTelemetry-compatible trace IDs. If you’re already using OpenTelemetry, ABV automatically extracts the trace context and correlates spans.In the dashboard, search by trace ID to see the full request timeline across all services.See Trace IDs & Distributed Tracing for advanced patterns.
Modern LLM applications process more than just text. Vision models analyze images. Voice assistants transcribe audio. Document pipelines parse PDFs.ABV captures all modalities:
  • Text: Standard messages, prompts, completions
  • Images: Base64-encoded or URL-referenced images for GPT-4 Vision, Claude 4, Gemini Pro Vision
  • Audio: Transcriptions (Whisper), voice inputs, TTS outputs
  • Files: PDFs, CSVs, JSON, spreadsheets, code files
Example: Tracing a vision model call
response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)
# ABV automatically captures the image URL and displays it in the dashboard
In the dashboard:
  • Images render inline for visual inspection
  • Audio files play directly in the trace viewer
  • Files are downloadable for local analysis
  • JSON/structured outputs are syntax-highlighted
This eliminates the need to reconstruct multi-modal inputs manually or store attachments separately.See Multi-Modality and Attachments for supported formats and size limits.

Advanced Tracing Features

Every LLM call costs money. Without tracking, you don’t know which features, users, or workflows are expensive. ABV automatically calculates costs for every trace based on:
  • Input tokens Ɨ provider’s input price
  • Output tokens Ɨ provider’s output price
  • Provider-specific pricing (cached tokens, batch discounts, etc.)
What ABV tracks:
  • Input/output token counts
  • Total cost per trace (in USD)
  • Cost breakdowns by model, user, session, environment, metadata
  • Cost trends over time (daily, weekly, monthly aggregations)
Cost analysis queries:
  • ā€œWhich users cost the most to serve?ā€
  • ā€œHow much did feature X cost last month?ā€
  • ā€œIs GPT-4 or Claude cheaper for summarization?ā€
  • ā€œDid costs increase after deploying version 2.0?ā€
Set up usage alerts to get notified when costs exceed thresholds (e.g., >$100/day).See Model Usage & Cost Tracking for detailed cost optimization strategies.
LLM traces often contain sensitive data:
  • PII: Emails, phone numbers, social security numbers, addresses
  • Secrets: API keys, passwords, tokens
  • Proprietary prompts: System instructions you don’t want logged
ABV supports multiple masking strategies:1. Regex-based masking (replace patterns with [REDACTED]):
abv.init(
    api_key="...",
    mask_patterns=[
        r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",  # Emails
        r"\b\d{3}-\d{2}-\d{4}\b",  # SSNs
        r"sk-[a-zA-Z0-9]{32,}"  # OpenAI API keys
    ]
)
2. Field-level masking (exclude entire fields from logging):
abv.init(
    api_key="...",
    exclude_fields=["messages.0.content", "metadata.api_key"]
)
3. Custom masking functions (dynamic logic):
def custom_masker(data):
    if "email" in data.get("metadata", {}):
        data["metadata"]["email"] = "[REDACTED]"
    return data

abv.init(api_key="...", masker=custom_masker)
Compliance:
  • GDPR: Mask PII or exclude fields containing personal data
  • HIPAA: Redact PHI (protected health information)
  • SOC 2: Demonstrate data handling controls with audit logs
See Masking Sensitive Data for implementation guides and compliance patterns.
Not all traces are equally important. Log levels let you prioritize signal over noise:
  • DEBUG: Verbose internal details (tool calls, intermediate steps)
  • INFO: Standard traces (successful LLM calls)
  • WARNING: Degraded performance (slow responses, fallbacks)
  • ERROR: Failures (API errors, timeouts, invalid outputs)
Setting log levels:
# Set minimum log level (only log WARNING and ERROR)
abv.init(api_key="...", log_level="WARNING")

# Or set per-trace
with abv.observe(log_level="ERROR"):
    response = client.chat.completions.create(...)
Use cases:
  • Production: Set to WARNING or ERROR to reduce noise and focus on issues
  • Development: Set to DEBUG for maximum visibility
  • Cost optimization: Only log ERROR traces to minimize ingestion costs
In the dashboard, filter by log level to focus on failures or warnings.See Log Levels for severity guidelines.
You deployed a new prompt, updated the model, or changed the RAG retrieval logic. Did quality improve or degrade? Releases annotate traces with version information so you can:
  • Compare performance before/after a deployment
  • Identify regressions introduced by code changes
  • Correlate quality drops with specific versions
  • A/B test different versions in production
Setting a release version:
abv.init(api_key="...", release="v1.2.3")
Versioning strategies:
  • Semantic versioning: v1.2.3 (major.minor.patch)
  • Git commit SHAs: abc1234 (unique per deployment)
  • Timestamp-based: 2025-01-15-v2 (human-readable)
In the dashboard, filter traces by release version and compare metrics (latency, cost, error rate) across versions.See Releases & Versioning for deployment workflows.
Debugging complex issues often requires collaboration. Instead of copying trace URLs into Slack and losing context, comment directly on traces in the ABV dashboard.Use cases:
  • Code reviews: ā€œThis prompt could be more concise—too many tokensā€
  • Bug reports: ā€œ@engineer This RAG retrieval pulled the wrong documentā€
  • Customer support: ā€œUser reported this response as unhelpful—needs investigationā€
  • Evaluations: ā€œMark this output as incorrect for the next benchmark runā€
Features:
  • Markdown support for formatting
  • @mentions to notify teammates
  • Thread replies for discussions
  • Comments visible on trace, session, and evaluation pages
In the dashboard, click the comment icon on any trace to add notes. Teammates get notified and can reply inline.See Comments on Objects for collaboration workflows.
Tracing 100% of production traffic can be expensive, especially at scale (millions of traces/day). Sampling lets you capture a representative subset while reducing ingestion costs.Sampling strategies:1. Rate-based sampling (trace X% of traffic):
abv.init(api_key="...", sample_rate=0.1)  # Trace 10%
2. Rule-based sampling (trace errors, slow requests, specific users):
def should_sample(trace):
    # Always trace errors
    if trace.get("error"):
        return True
    # Always trace slow requests
    if trace.get("latency_ms", 0) > 5000:
        return True
    # Sample 10% of everything else
    return random.random() < 0.1

abv.init(api_key="...", sampler=should_sample)
3. Environment-specific sampling:
sample_rate = 1.0 if os.getenv("ENV") == "development" else 0.05
abv.init(api_key="...", sample_rate=sample_rate)
# Trace 100% in dev, 5% in production
Best practices:
  • Start with 100% sampling until you understand your traffic patterns
  • Always trace errors and slow requests (use rule-based sampling)
  • Sample more aggressively in production (5-10%) than dev (100%)
  • Exclude health checks and monitoring probes from tracing
Sampling saves costs without sacrificing observability. You still get representative data for debugging, cost analysis, and evaluations.See Sampling for advanced sampling strategies.

Related Topics