How Observability & Tracing Works
Observability is the ability to understand whatās happening inside your LLM application by examining its external outputs. Traces are the fundamental units of observabilityāindividual records capturing complete LLM interactions from start to finish.Install the SDK
Initialize with Your API Key
Run Your LLM Application
- Inputs: User queries, prompts, system instructions, few-shot examples
- Outputs: LLM responses, tool calls, structured outputs
- Metadata: Tokens (input/output), latency, costs, model parameters
- Context: Sessions, users, environments, tags, custom metadata
View Traces in the Dashboard
- Full prompt and response content
- Token usage and costs broken down by model
- Latency metrics (first token, total duration)
- Tool/function calls with inputs and outputs
- Error messages and stack traces for failures
Build on Your Data
- Evaluations: Run systematic benchmarks on production data
- Prompt Management: Version prompts and A/B test changes
- Cost Tracking: Analyze spend by feature, user, or model
- Guardrails: Catch problematic outputs before they reach users
Why You Should Set Up Tracing
Accelerate Development & Debugging
Accelerate Development & Debugging
- See the complete interaction timeline for any request, including retries, tool calls, and nested LLM calls
- Replay production traces with identical inputs to reproduce bugs deterministically
- Compare successful vs. failed traces side-by-side to identify root causes
- Share trace URLs with teammates for instant context sharing
- Customer reports an issue: āThe chatbot gave me wrong refund informationā
- Search traces by user ID to find their session
- Click the problematic trace to see the full conversation history
- Notice the RAG retrieval step pulled outdated policy documents
- Fix the retrieval logic and verify with new traces
Track Costs in Real Time
Track Costs in Real Time
- Which features or workflows are most expensive
- Whether users are exploiting your system with excessive requests
- If a recent code change increased costs
- How much each customer/tenant costs to serve
- Input/output tokens for every model call
- Calculated costs using current provider pricing
- Cost breakdowns by user, session, environment, model, or custom metadata
- Cost trends over time with daily/weekly/monthly aggregations
- Discovered a feature using GPT-4 when GPT-3.5 would suffice ā 10x cost reduction
- Identified users triggering infinite retry loops ā prevented runaway costs
- Found prompt inefficiencies (e.g., redundant context) ā reduced tokens 30%
Optimize Performance
Optimize Performance
- Identify slow model calls, RAG retrievals, or tool executions
- Measure time-to-first-token (streaming latency) vs. total duration
- Compare latency across models (e.g., GPT-4 vs. Claude vs. Llama)
- Track latency percentiles (p50, p95, p99) to catch tail latency issues
- Correlate latency with cost to find the best price/performance tradeoff
- Use faster models for simple tasks (gpt-3.5-turbo vs. gpt-4)
- Implement caching for repeated queries (ABV integrates with prompt caching)
- Parallelize independent LLM/tool calls
- Stream responses to improve perceived latency
- Set aggressive timeouts to prevent hanging requests
Enable Systematic Evaluations
Enable Systematic Evaluations
- Prompt versioning: Compare v1 vs. v2 on real user queries
- Model comparisons: Test GPT-4 vs. Claude on your specific use case
- Regression detection: Catch quality drops after code/prompt changes
- A/B testing: Measure user satisfaction, conversion, or task success rates
- LLM-as-a-Judge: Automatically score outputs for relevance, coherence, safety
- Export 100 production traces from the past week
- Create a dataset in ABV Evaluations
- Run both the current prompt and a new candidate prompt
- Compare outputs using LLM-as-a-Judge scoring
- Promote the winning prompt to production with confidence
Streamline Customer Support
Streamline Customer Support
- Search traces by user ID, session ID, or custom metadata (e.g., order number)
- See exactly what the user said, what the LLM responded, and what went wrong
- Share trace URLs with engineering for instant context handoff
- Filter by error status to proactively find and fix issues before users complain
- Add comments on traces for internal notes and collaboration
- Customer: āI asked for a refund but got an errorā
- Support agent searches traces by user email
- Finds the failed trace, sees the LLM tried to call a deprecated API
- Shares trace URL with engineering: āThis API call is failingā
- Engineering fixes the integration, support confirms with new traces
Core Tracing Features
Sessions: Group Related Traces by User Journey
Sessions: Group Related Traces by User Journey
Users: Link Traces to User Accounts
Users: Link Traces to User Accounts
- Searching all traces for a specific user (for support or debugging)
- Analyzing cost per user (identify expensive users or tiers)
- Filtering traces by user cohorts (free vs. paid, region, etc.)
- Personalizing evaluations (compare model performance by user segment)
Environments: Separate Dev, Staging, and Production
Environments: Separate Dev, Staging, and Production
- Test data polluting production metrics
- Accidentally analyzing dev traces when investigating production issues
- Mixing performance benchmarks across environments (dev is slower than prod)
development: Local development machinesstaging: Pre-production testing environmentproduction: Live user-facing applicationci: Continuous integration test runs
Metadata: Attach Structured Context to Traces
Metadata: Attach Structured Context to Traces
- Model parameters:
{"model": "gpt-4", "temperature": 0.7} - Business context:
{"tenant_id": "acme-corp", "feature": "summarization"} - Deployment info:
{"version": "1.2.3", "region": "us-west-2"} - User context:
{"subscription_tier": "enterprise", "ab_test_group": "variant_b"}
metadata.tenant_id = "acme-corp"ā All traces for a specific tenantmetadata.feature = "summarization"ā All traces for a specific featuremetadata.version = "1.2.3" AND environment = "production"ā Traces for a specific deployment
Tags: Add Flexible Labels for Categorization
Tags: Add Flexible Labels for Categorization
Trace IDs: Correlate Events Across Distributed Services
Trace IDs: Correlate Events Across Distributed Services
- API gateway (validates request)
- Backend service (calls LLM)
- RAG service (retrieves documents)
- Database (logs results)
- Generate a trace ID when the request enters your system
- Pass the trace ID to every downstream service (via HTTP headers, message queues, etc.)
- Each service logs events with the same trace ID
- ABV groups all events by trace ID for end-to-end visibility
Multi-Modality: Store Text, Images, Audio, and Files
Multi-Modality: Store Text, Images, Audio, and Files
- Text: Standard messages, prompts, completions
- Images: Base64-encoded or URL-referenced images for GPT-4 Vision, Claude 4, Gemini Pro Vision
- Audio: Transcriptions (Whisper), voice inputs, TTS outputs
- Files: PDFs, CSVs, JSON, spreadsheets, code files
- Images render inline for visual inspection
- Audio files play directly in the trace viewer
- Files are downloadable for local analysis
- JSON/structured outputs are syntax-highlighted
Advanced Tracing Features
Model Usage & Cost Tracking
Model Usage & Cost Tracking
- Input tokens Ć providerās input price
- Output tokens Ć providerās output price
- Provider-specific pricing (cached tokens, batch discounts, etc.)
- Input/output token counts
- Total cost per trace (in USD)
- Cost breakdowns by model, user, session, environment, metadata
- Cost trends over time (daily, weekly, monthly aggregations)
- āWhich users cost the most to serve?ā
- āHow much did feature X cost last month?ā
- āIs GPT-4 or Claude cheaper for summarization?ā
- āDid costs increase after deploying version 2.0?ā
Masking Sensitive Data (PII, Secrets, Prompts)
Masking Sensitive Data (PII, Secrets, Prompts)
- PII: Emails, phone numbers, social security numbers, addresses
- Secrets: API keys, passwords, tokens
- Proprietary prompts: System instructions you donāt want logged
[REDACTED]):- GDPR: Mask PII or exclude fields containing personal data
- HIPAA: Redact PHI (protected health information)
- SOC 2: Demonstrate data handling controls with audit logs
Log Levels: Filter Noise by Severity
Log Levels: Filter Noise by Severity
DEBUG: Verbose internal details (tool calls, intermediate steps)INFO: Standard traces (successful LLM calls)WARNING: Degraded performance (slow responses, fallbacks)ERROR: Failures (API errors, timeouts, invalid outputs)
- Production: Set to
WARNINGorERRORto reduce noise and focus on issues - Development: Set to
DEBUGfor maximum visibility - Cost optimization: Only log
ERRORtraces to minimize ingestion costs
Releases & Versioning: Track Code and Model Versions
Releases & Versioning: Track Code and Model Versions
- Compare performance before/after a deployment
- Identify regressions introduced by code changes
- Correlate quality drops with specific versions
- A/B test different versions in production
- Semantic versioning:
v1.2.3(major.minor.patch) - Git commit SHAs:
abc1234(unique per deployment) - Timestamp-based:
2025-01-15-v2(human-readable)
Comments on Objects: Collaborate on Traces
Comments on Objects: Collaborate on Traces
Trace URLs: Share Deep Links for Reproducible Reports
Trace URLs: Share Deep Links for Reproducible Reports
Sampling: Control Volume and Cost
Sampling: Control Volume and Cost
- Start with 100% sampling until you understand your traffic patterns
- Always trace errors and slow requests (use rule-based sampling)
- Sample more aggressively in production (5-10%) than dev (100%)
- Exclude health checks and monitoring probes from tracing
- Code reviews: āThis prompt could be more conciseātoo many tokensā
- Bug reports: ā@engineer This RAG retrieval pulled the wrong documentā
- Customer support: āUser reported this response as unhelpfulāneeds investigationā
- Evaluations: āMark this output as incorrect for the next benchmark runā
Features:- Markdown support for formatting
- @mentions to notify teammates
- Thread replies for discussions
- Comments visible on trace, session, and evaluation pages
In the dashboard, click the comment icon on any trace to add notes. Teammates get notified and can reply inline.See Comments on Objects for collaboration workflows.