How Linking Prompts to Traces Works
Understanding the integration between prompt management and observability:Fetch prompt from ABV
Your application fetches the prompt at runtime using the ABV SDK:The
prompt object contains:- Prompt content (with variables)
- Version number
- Labels pointing to this version
- Config (model parameters, etc.)
- Metadata (name, type, etc.)
Compile prompt with variables
Fill in variables to create the actual prompt sent to the LLM:The compiled prompt is the actual text sent to the LLM, while the original
prompt object retains metadata for linking.Link prompt to generation span
When creating the LLM generation span, pass the Using context managers:ABV extracts prompt name, version, and labels, associating them with the generation span.
prompt object:Using decorators:Automatic metric aggregation
Once linked, ABV automatically aggregates metrics by prompt version:
- Quality scores: Average scores grouped by prompt version
- Latency: Median, p95, p99 latency by prompt version
- Token usage: Input tokens, output tokens, total tokens
- Costs: Calculated from token usage and model pricing
- Volume: Count of generations per prompt version
- Timestamps: First and last generation for each version
Compare prompt versions
Use metrics to compare prompt versions:
- Side-by-side comparison: Select two versions to compare quality, latency, and costs
- Time series charts: See how metrics evolved across prompt deployments
- Regression detection: Identify when a new prompt version degraded performance
- A/B test analysis: Compare concurrent versions running in A/B tests
Implementation by SDK
Complete integration examples for Python and JavaScript/TypeScript:Python SDK
Python SDK
Install dependencies:Using decorators (recommended for simplicity):Using context managers (recommended for more control):
If a fallback prompt is used (when ABV is unavailable), no link will be created to preserve application reliability.
JavaScript/TypeScript SDK
JavaScript/TypeScript SDK
Install dependencies:Set up environment variables (Create instrumentation file (Using manual observations:Using context manager:Using observe wrapper:
.env file):instrumentation.ts):If a fallback prompt is used, no link will be created.
Metrics Available by Prompt Version
Once prompts are linked to traces, ABV tracks the following metrics by prompt version:Quality Metrics
Quality Metrics
Score aggregation: ABV aggregates all score types by prompt version:
- User feedback scores: Thumbs up/down, ratings, satisfaction surveys
- Model-based scores: Automated evaluation scores (relevance, correctness, safety)
- Human-in-the-loop scores: Expert annotations on sampled traces
- Custom scores: Application-specific quality metrics
- Median score value per prompt version
- Score distribution (min, max, percentiles)
- Score trend over time for each version
- Compare quality between prompt versions: “Did version 3 improve scores vs. version 2?”
- Identify regressions: “Version 5 has lower quality scores than version 4”
- Validate A/B test winners: “Variant A has statistically higher scores than variant B”
Performance Metrics
Performance Metrics
Latency tracking: ABV measures generation latency by prompt version:
- Median generation latency: Typical response time
- p95/p99 latency: Tail latency for worst-case analysis
- Time-to-first-token: For streaming responses
- Generation count: Volume of requests per version
- Median input tokens per generation
- Median output tokens per generation
- Total tokens consumed by prompt version
- Identify slow prompts: “Version 4 has 50ms higher latency than version 3”
- Optimize token usage: “This prompt variation uses 30% fewer tokens”
- Track performance trends: “Latency increased after deploying version 5”
Cost Metrics
Cost Metrics
Cost calculation: ABV calculates costs by prompt version based on token usage and model pricing:
- Median generation cost (per request)
- Total cost by prompt version
- Cost breakdown: Input tokens vs. output tokens
- Cost trends over time
- Cost optimization: “Version 3 costs 20% less than version 2 due to shorter prompts”
- Budget tracking: “This prompt version costs $500/day in production”
- ROI analysis: “Higher quality version costs $100/day more but reduces support tickets”
Temporal Metrics
Temporal Metrics
Timestamp tracking: ABV records when each prompt version was used:
- First generation timestamp for version
- Last generation timestamp for version
- Time series: Generations per day/hour/minute
- Deployment tracking: “Version 4 went live at 2pm yesterday”
- Adoption analysis: “Version 3 still receiving 10% of traffic due to caching”
- Incident correlation: “Quality degradation started at 3pm when version 5 deployed”
Using Metrics for Prompt Optimization
Practical workflows leveraging prompt-to-trace metrics:Validating Prompt Improvements
Validating Prompt Improvements
Scenario: You improved a prompt and want to verify it actually performs better in production.Workflow:
- Baseline: Current version (v2) in production with historical metrics
- Deploy new version: Create v3, deploy to production with
productionlabel - Collect data: Run v3 for 24-48 hours to accumulate sufficient samples
- Compare metrics: Navigate to prompt in ABV dashboard, compare v2 vs. v3
- Quality scores: Did median score improve?
- Latency: Did response time change?
- Costs: Did token usage increase or decrease?
- Decision:
- If v3 improves quality without degrading latency/cost: Keep v3
- If v3 degrades quality or increases cost too much: Roll back to v2
- If results are mixed: Run longer A/B test for statistical significance
Debugging Quality Regressions
Debugging Quality Regressions
Scenario: Quality metrics dropped after a recent deployment. You need to identify which prompt change caused the issue.Workflow:
- Identify regression window: Check metrics dashboard to see when scores dropped
- Review prompt history: View prompt versions deployed during that time period
- Compare versions: Use diff view to see what changed between versions
- Correlate with metrics: Match deployment timestamps with metric changes
- Reproduce issue: Fetch the suspect version and test locally:
- Root cause analysis: Identify specific prompt change that caused regression
- Fix and redeploy: Create new version with fix, validate in staging, deploy
A/B Testing Prompt Variants
A/B Testing Prompt Variants
Scenario: You have two prompt variants and want to determine which performs better.Workflow:
- Create variants:
- v2: Variant A, assign
variant-alabel - v3: Variant B, assign
variant-blabel
- v2: Variant A, assign
- Implement randomization:
- Collect data: Run for days/weeks to achieve statistical power
- Analyze results: Compare metrics by prompt version:
- Quality: v2 median score 4.2/5, v3 median score 4.5/5
- Latency: v2 median 450ms, v3 median 480ms (slightly slower)
- Cost: v2 median 0.004 (20% more expensive)
- Calculate significance: Use statistical tests to validate results
- Promote winner: Reassign
productionlabel to better variant
Monitoring Production Prompts
Monitoring Production Prompts
Scenario: Set up continuous monitoring of production prompts to detect issues early.Setup:
- Link all prompts to traces: Ensure all generation spans include prompt metadata
- Configure dashboards: Create custom dashboards showing:
- Quality trends over time for production prompt version
- Latency p95/p99 for production version
- Cost per day for production version
- Volume (generations/day) for production version
- Set up alerts: Configure alerts for:
- Quality score drops below threshold
- Latency increases above threshold
- Cost per generation exceeds budget
- Generation volume spikes or drops unexpectedly
- Regular review: Weekly review of prompt metrics to identify optimization opportunities
Metrics Reference
Complete list of metrics tracked when prompts are linked to traces:| Metric | Description | Aggregation |
|---|---|---|
| Median generation latency | Median time from generation start to end | Median across all generations |
| Median input tokens | Median count of input tokens | Median across all generations |
| Median output tokens | Median count of output tokens | Median across all generations |
| Median generation cost | Median cost per generation (input + output tokens) | Median across all generations |
| Generation count | Total number of generations using this prompt version | Sum |
| Median score value | Median score across all score types (user, model, human) | Median across all scores |
| First generation timestamp | When this prompt version was first used | Earliest timestamp |
| Last generation timestamp | When this prompt version was most recently used | Latest timestamp |
| Quality trend | Change in median score over time | Time series |
| Cost trend | Change in median cost over time | Time series |
| Latency trend | Change in median latency over time | Time series |
Next Steps
Get Started with Prompts
Create your first prompt and link it to traces
Version Control
Manage prompt versions and labels for deployment
A/B Testing
Compare prompt versions with A/B testing workflows
Scores Data Model
Understand score types and how they aggregate by prompt version
Observability & Tracing
Learn more about generation spans and observability instrumentation
Metrics Dashboard
Explore metrics beyond prompt-specific tracking