How Observability & Tracing Works

Observability is the ability to understand what’s happening inside your LLM application by examining its external outputs. Traces are the fundamental units of observability—individual records capturing complete LLM interactions from start to finish.

Install the SDK

Add ABV to your application using the Python or JavaScript/TypeScript SDK. The SDK automatically instruments common LLM libraries (OpenAI, Anthropic, LangChain, LlamaIndex, etc.) with zero manual tracing code required for most use cases.

pip install abvdev  # Python
npm install @abvdev/tracing @abvdev/otel  # JavaScript/TypeScript

Initialize with Your API Key

Configure ABV with your project’s API key (found in the ABV Dashboard). The SDK connects to ABV’s backend and starts capturing telemetry automatically.

from abvdev import ABV

abv = ABV(api_key="your-api-key-here")

Run Your LLM Application

Make LLM calls as you normally would. ABV automatically captures:

Inputs: User queries, prompts, system instructions, few-shot examples
Outputs: LLM responses, tool calls, structured outputs
Metadata: Tokens (input/output), latency, costs, model parameters
Context: Sessions, users, environments, tags, custom metadata

from openai import OpenAI

client = OpenAI()

# ABV automatically traces this call
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

View Traces in the Dashboard

Navigate to the ABV Dashboard to explore traces. Each trace shows the complete interaction timeline, including:

Full prompt and response content
Token usage and costs broken down by model
Latency metrics (first token, total duration)
Tool/function calls with inputs and outputs
Error messages and stack traces for failures

Filter by user, session, environment, tags, or metadata to find specific interactions. Share trace URLs with teammates for collaborative debugging.

Build on Your Data

Once traces are flowing, leverage ABV’s advanced features:

Evaluations: Run systematic benchmarks on production data
Prompt Management: Version prompts and A/B test changes
Cost Tracking: Analyze spend by feature, user, or model
Guardrails: Catch problematic outputs before they reach users

Your traces become the foundation for continuous improvement.

Why You Should Set Up Tracing

Accelerate Development & Debugging

Traditional debugging for LLM applications is painful. Print statements don’t work for complex agent workflows. Logs are scattered across services. Reproducing production issues locally is nearly impossible due to non-deterministic LLM behavior.With ABV tracing:

See the complete interaction timeline for any request, including retries, tool calls, and nested LLM calls
Replay production traces with identical inputs to reproduce bugs deterministically
Compare successful vs. failed traces side-by-side to identify root causes
Share trace URLs with teammates for instant context sharing

Example workflow:

Customer reports an issue: “The chatbot gave me wrong refund information”
Search traces by user ID to find their session
Click the problematic trace to see the full conversation history
Notice the RAG retrieval step pulled outdated policy documents
Fix the retrieval logic and verify with new traces

What used to take hours of back-and-forth with customers now takes minutes.

Track Costs in Real Time

LLM costs can spiral quickly, especially with production traffic. Without granular tracking, you don’t know:

Which features or workflows are most expensive
Whether users are exploiting your system with excessive requests
If a recent code change increased costs
How much each customer/tenant costs to serve

ABV automatically tracks:

Input/output tokens for every model call
Calculated costs using current provider pricing
Cost breakdowns by user, session, environment, model, or custom metadata
Cost trends over time with daily/weekly/monthly aggregations

Real-world savings:

Discovered a feature using GPT-4 when GPT-3.5 would suffice → 10x cost reduction
Identified users triggering infinite retry loops → prevented runaway costs
Found prompt inefficiencies (e.g., redundant context) → reduced tokens 30%

Set up usage alerts to get notified when costs exceed thresholds.

Optimize Performance

Latency matters for LLM applications. Users notice when responses are slow, and slow responses hurt conversion rates, engagement, and satisfaction.ABV helps you optimize:

Identify slow model calls, RAG retrievals, or tool executions
Measure time-to-first-token (streaming latency) vs. total duration
Compare latency across models (e.g., GPT-4 vs. Claude vs. Llama)
Track latency percentiles (p50, p95, p99) to catch tail latency issues
Correlate latency with cost to find the best price/performance tradeoff

Optimization patterns:

Use faster models for simple tasks (gpt-3.5-turbo vs. gpt-4)
Implement caching for repeated queries (ABV integrates with prompt caching)
Parallelize independent LLM/tool calls
Stream responses to improve perceived latency
Set aggressive timeouts to prevent hanging requests

Filter traces by latency ranges (e.g., >5 seconds) to find slow outliers.

Enable Systematic Evaluations

You can’t improve what you don’t measure. Manual testing doesn’t scale, and gut feelings aren’t data. ABV’s evaluation tools require a foundation of production traces.Evaluations unlock:

Prompt versioning: Compare v1 vs. v2 on real user queries
Model comparisons: Test GPT-4 vs. Claude on your specific use case
Regression detection: Catch quality drops after code/prompt changes
A/B testing: Measure user satisfaction, conversion, or task success rates
LLM-as-a-Judge: Automatically score outputs for relevance, coherence, safety

Example evaluation workflow:

Export 100 production traces from the past week
Create a dataset in ABV Evaluations
Run both the current prompt and a new candidate prompt
Compare outputs using LLM-as-a-Judge scoring
Promote the winning prompt to production with confidence

Without traces, you’re guessing. With traces, you’re measuring.

Streamline Customer Support

When customers report issues, support teams waste hours trying to reproduce problems or gather context. “It didn’t work” isn’t actionable without details.ABV transforms support:

Search traces by user ID, session ID, or custom metadata (e.g., order number)
See exactly what the user said, what the LLM responded, and what went wrong
Share trace URLs with engineering for instant context handoff
Filter by error status to proactively find and fix issues before users complain
Add comments on traces for internal notes and collaboration

Support playbook:

Customer: “I asked for a refund but got an error”
Support agent searches traces by user email
Finds the failed trace, sees the LLM tried to call a deprecated API
Shares trace URL with engineering: “This API call is failing”
Engineering fixes the integration, support confirms with new traces

Time to resolution drops from days to hours. Customer satisfaction improves. Engineering gets actionable bug reports instead of vague complaints.

Core Tracing Features

Sessions: Group Related Traces by User Journey

A session groups multiple traces together to represent a complete user journey or job. For example:

A customer support conversation (10+ back-and-forth messages)
A document processing pipeline (upload → extract → summarize → classify)
A user’s interactions during a single login session

Why sessions matter:

See the full context leading up to an error (not just the final failed trace)
Measure end-to-end latency and cost for multi-step workflows
Analyze user behavior patterns across traces
Debug complex agent workflows where context builds over time

Setting a session ID:

# Option 1: Set session ID for all traces in a context
with abv.observe(session_id="user-abc-session-123"):
    # All traces within this context belong to the session
    response1 = client.chat.completions.create(...)
    response2 = client.chat.completions.create(...)

# Option 2: Set session ID for a single trace
abv.trace(session_id="user-abc-session-123")

In the dashboard, click a session ID to see all traces in that session grouped together with aggregated metrics.See Sessions for best practices.

Users: Link Traces to User Accounts

A user ID identifies who triggered the trace. This enables:

Searching all traces for a specific user (for support or debugging)
Analyzing cost per user (identify expensive users or tiers)
Filtering traces by user cohorts (free vs. paid, region, etc.)
Personalizing evaluations (compare model performance by user segment)

Setting a user ID:

# Set user ID for all traces
abv.init(api_key="...", user_id="user-abc-123")

# Or set per-trace
with abv.observe(user_id="user-abc-123"):
    response = client.chat.completions.create(...)

User metadata: You can also attach user metadata (email, name, subscription tier) to enrich traces:

abv.init(
    api_key="...",
    user_id="user-abc-123",
    user_metadata={
        "email": "user@example.com",
        "tier": "enterprise",
        "region": "us-west"
    }
)

In the dashboard, filter by user ID or user metadata to analyze specific segments.See User Tracking for GDPR-compliant user handling.

Environments: Separate Dev, Staging, and Production

Environments separate traces by deployment stage (development, staging, production). This prevents:

Test data polluting production metrics
Accidentally analyzing dev traces when investigating production issues
Mixing performance benchmarks across environments (dev is slower than prod)

Setting an environment:

import os

abv.init(
    api_key="...",
    environment=os.getenv("ENV", "development")  # e.g., "production"
)

Common environment names:

development: Local development machines
staging: Pre-production testing environment
production: Live user-facing application
ci: Continuous integration test runs

In the dashboard, filter by environment to compare performance or isolate issues. You can also set different sampling rates per environment (e.g., trace 100% in dev, 10% in production).See Environments for deployment strategies.

Metadata: Attach Structured Context to Traces

Metadata is arbitrary key-value data attached to traces. Use it to add context that makes filtering and analysis precise:

Model parameters: {"model": "gpt-4", "temperature": 0.7}
Business context: {"tenant_id": "acme-corp", "feature": "summarization"}
Deployment info: {"version": "1.2.3", "region": "us-west-2"}
User context: {"subscription_tier": "enterprise", "ab_test_group": "variant_b"}

Adding metadata:

with abv.observe(metadata={
    "tenant_id": "acme-corp",
    "feature": "document-qa",
    "model": "gpt-4",
    "version": "1.2.3"
}):
    response = client.chat.completions.create(...)

Querying by metadata: In the dashboard, filter traces with advanced queries:

metadata.tenant_id = "acme-corp" → All traces for a specific tenant
metadata.feature = "summarization" → All traces for a specific feature
metadata.version = "1.2.3" AND environment = "production" → Traces for a specific deployment

Metadata is indexed for fast querying. You can export filtered traces for evaluations or analysis.See Metadata for best practices and advanced patterns.

Tags: Add Flexible Labels for Categorization

Tags are simple string labels that categorize traces for quick filtering. Unlike metadata (structured key-value pairs), tags are flat labels:

["error", "timeout"] → Mark traces with failures
["high-priority", "customer-facing"] → Prioritize critical workflows
["experiment-v2", "canary"] → Flag experimental features

Adding tags:

with abv.observe(tags=["experiment-v2", "high-priority"]):
    response = client.chat.completions.create(...)

When to use tags vs. metadata:

Tags: Simple categorization, boolean flags (“is this an error?”), filtering by presence
Metadata: Structured data with specific values (“which tenant?”, “what model version?”), filtering by exact match or range

In the dashboard, click a tag to filter all traces with that tag instantly. Tags appear as chips for visual scanning.See Tags for tagging strategies.

Trace IDs: Correlate Events Across Distributed Services

In distributed systems, a single user request might trigger:

API gateway (validates request)
Backend service (calls LLM)
RAG service (retrieves documents)
Database (logs results)

Without correlation, you can’t connect events across services. Trace IDs solve this by propagating a unique identifier through the entire request lifecycle.How it works:

Generate a trace ID when the request enters your system
Pass the trace ID to every downstream service (via HTTP headers, message queues, etc.)
Each service logs events with the same trace ID
ABV groups all events by trace ID for end-to-end visibility

Setting a custom trace ID:

import uuid

# Generate a trace ID (or extract from incoming request headers)
trace_id = str(uuid.uuid4())

# Pass to ABV
with abv.observe(trace_id=trace_id):
    response = client.chat.completions.create(...)

Distributed tracing with OpenTelemetry: ABV supports OpenTelemetry-compatible trace IDs. If you’re already using OpenTelemetry, ABV automatically extracts the trace context and correlates spans.In the dashboard, search by trace ID to see the full request timeline across all services.See Trace IDs & Distributed Tracing for advanced patterns.

Multi-Modality: Store Text, Images, Audio, and Files

Modern LLM applications process more than just text. Vision models analyze images. Voice assistants transcribe audio. Document pipelines parse PDFs.ABV captures all modalities:

Text: Standard messages, prompts, completions
Images: Base64-encoded or URL-referenced images for GPT-4 Vision, Claude 4, Gemini Pro Vision
Audio: Transcriptions (Whisper), voice inputs, TTS outputs
Files: PDFs, CSVs, JSON, spreadsheets, code files

Example: Tracing a vision model call

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)
# ABV automatically captures the image URL and displays it in the dashboard

In the dashboard:

Images render inline for visual inspection
Audio files play directly in the trace viewer
Files are downloadable for local analysis
JSON/structured outputs are syntax-highlighted

This eliminates the need to reconstruct multi-modal inputs manually or store attachments separately.See Multi-Modality and Attachments for supported formats and size limits.

Advanced Tracing Features

Model Usage & Cost Tracking

Every LLM call costs money. Without tracking, you don’t know which features, users, or workflows are expensive. ABV automatically calculates costs for every trace based on:

Input tokens × provider’s input price
Output tokens × provider’s output price
Provider-specific pricing (cached tokens, batch discounts, etc.)

What ABV tracks:

Input/output token counts
Total cost per trace (in USD)
Cost breakdowns by model, user, session, environment, metadata
Cost trends over time (daily, weekly, monthly aggregations)

Cost analysis queries:

“Which users cost the most to serve?”
“How much did feature X cost last month?”
“Is GPT-4 or Claude cheaper for summarization?”
“Did costs increase after deploying version 2.0?”

Set up usage alerts to get notified when costs exceed thresholds (e.g., >$100/day).See Model Usage & Cost Tracking for detailed cost optimization strategies.

Masking Sensitive Data (PII, Secrets, Prompts)

LLM traces often contain sensitive data:

PII: Emails, phone numbers, social security numbers, addresses
Secrets: API keys, passwords, tokens
Proprietary prompts: System instructions you don’t want logged

ABV supports multiple masking strategies:1. Regex-based masking (replace patterns with [REDACTED]):

abv.init(
    api_key="...",
    mask_patterns=[
        r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",  # Emails
        r"\b\d{3}-\d{2}-\d{4}\b",  # SSNs
        r"sk-[a-zA-Z0-9]{32,}"  # OpenAI API keys
    ]
)

2. Field-level masking (exclude entire fields from logging):

abv.init(
    api_key="...",
    exclude_fields=["messages.0.content", "metadata.api_key"]
)

3. Custom masking functions (dynamic logic):

def custom_masker(data):
    if "email" in data.get("metadata", {}):
        data["metadata"]["email"] = "[REDACTED]"
    return data

abv.init(api_key="...", masker=custom_masker)

Compliance:

GDPR: Mask PII or exclude fields containing personal data
HIPAA: Redact PHI (protected health information)
SOC 2: Demonstrate data handling controls with audit logs

See Masking Sensitive Data for implementation guides and compliance patterns.

Log Levels: Filter Noise by Severity

Not all traces are equally important. Log levels let you prioritize signal over noise:

DEBUG: Verbose internal details (tool calls, intermediate steps)
INFO: Standard traces (successful LLM calls)
WARNING: Degraded performance (slow responses, fallbacks)
ERROR: Failures (API errors, timeouts, invalid outputs)

Setting log levels:

# Set minimum log level (only log WARNING and ERROR)
abv.init(api_key="...", log_level="WARNING")

# Or set per-trace
with abv.observe(log_level="ERROR"):
    response = client.chat.completions.create(...)

Use cases:

Production: Set to WARNING or ERROR to reduce noise and focus on issues
Development: Set to DEBUG for maximum visibility
Cost optimization: Only log ERROR traces to minimize ingestion costs

In the dashboard, filter by log level to focus on failures or warnings.See Log Levels for severity guidelines.

Releases & Versioning: Track Code and Model Versions

You deployed a new prompt, updated the model, or changed the RAG retrieval logic. Did quality improve or degrade? Releases annotate traces with version information so you can:

Compare performance before/after a deployment
Identify regressions introduced by code changes
Correlate quality drops with specific versions
A/B test different versions in production

Setting a release version:

abv.init(api_key="...", release="v1.2.3")

Versioning strategies:

Semantic versioning: v1.2.3 (major.minor.patch)
Git commit SHAs: abc1234 (unique per deployment)
Timestamp-based: 2025-01-15-v2 (human-readable)

In the dashboard, filter traces by release version and compare metrics (latency, cost, error rate) across versions.See Releases & Versioning for deployment workflows.

Comments on Objects: Collaborate on Traces

Debugging complex issues often requires collaboration. Instead of copying trace URLs into Slack and losing context, comment directly on traces in the ABV dashboard.Use cases:

Code reviews: “This prompt could be more concise—too many tokens”
Bug reports: “@engineer This RAG retrieval pulled the wrong document”
Customer support: “User reported this response as unhelpful—needs investigation”
Evaluations: “Mark this output as incorrect for the next benchmark run”

Features:

Markdown support for formatting
@mentions to notify teammates
Thread replies for discussions
Comments visible on trace, session, and evaluation pages

In the dashboard, click the comment icon on any trace to add notes. Teammates get notified and can reply inline.See Comments on Objects for collaboration workflows.

Trace URLs: Share Deep Links for Reproducible Reports

Every trace has a unique URL. Share it to give teammates instant access to:

The full trace timeline (inputs, outputs, latency, costs)
Session context (all traces in the same session)
User metadata (who triggered this trace)
Comments and annotations

Example URL:

https://app.abv.dev/traces/tr-abc123456789

Use cases:

Bug reports: Paste trace URL into GitHub issues or Jira tickets
Customer support: Share trace URL with engineering for instant context
Code reviews: Reference specific traces in pull request comments
Evaluations: Link to problematic outputs for manual review

Trace URLs are stable (they don’t expire) and respect your team’s access controls (only team members with permission can view).See Trace URLs for sharing best practices.

Sampling: Control Volume and Cost

Tracing 100% of production traffic can be expensive, especially at scale (millions of traces/day). Sampling lets you capture a representative subset while reducing ingestion costs.Sampling strategies:1. Rate-based sampling (trace X% of traffic):

abv.init(api_key="...", sample_rate=0.1)  # Trace 10%

2. Rule-based sampling (trace errors, slow requests, specific users):

def should_sample(trace):
    # Always trace errors
    if trace.get("error"):
        return True
    # Always trace slow requests
    if trace.get("latency_ms", 0) > 5000:
        return True
    # Sample 10% of everything else
    return random.random() < 0.1

abv.init(api_key="...", sampler=should_sample)

3. Environment-specific sampling:

sample_rate = 1.0 if os.getenv("ENV") == "development" else 0.05
abv.init(api_key="...", sample_rate=sample_rate)
# Trace 100% in dev, 5% in production

Best practices:

Start with 100% sampling until you understand your traffic patterns
Always trace errors and slow requests (use rule-based sampling)
Sample more aggressively in production (5-10%) than dev (100%)
Exclude health checks and monitoring probes from tracing

Sampling saves costs without sacrificing observability. You still get representative data for debugging, cost analysis, and evaluations.See Sampling for advanced sampling strategies.

Sessions

User Tracking

Link traces to user accounts for faster debugging, cost analysis by user, and GDPR-compliant data handling

Metadata & Tags

Attach structured context and labels to traces for precise filtering and analysis

Cost Tracking

Monitor tokens, runtime, and spend in real time to optimize model selection and reduce costs

Evaluations

Run systematic benchmarks on production traces to measure quality, compare prompts, and catch regressions

Prompt Management

Version, deploy, and A/B test prompts with automatic tracing integration for metrics by variant

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

Overview

How Observability & Tracing Works

Why You Should Set Up Tracing

Core Tracing Features

Advanced Tracing Features

Sessions

User Tracking

Metadata & Tags

Cost Tracking

Evaluations

Prompt Management

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​How Observability & Tracing Works

​Why You Should Set Up Tracing

​Core Tracing Features

​Advanced Tracing Features

​Related Topics

Sessions

User Tracking

Metadata & Tags

Cost Tracking

Evaluations

Prompt Management

How Observability & Tracing Works

Why You Should Set Up Tracing

Core Tracing Features

Advanced Tracing Features

Related Topics