Linking prompts to traces enables tracking which prompt version generated each LLM response. This connection powers metrics by prompt version, comparison across versions, and data-driven iteration—transforming prompt management from guesswork into systematic optimization.

How Linking Prompts to Traces Works

Understanding the integration between prompt management and observability:

Fetch prompt from ABV

Your application fetches the prompt at runtime using the ABV SDK:

prompt = abv.get_prompt("movie-critic")  # Fetches production version

The prompt object contains:

Prompt content (with variables)
Version number
Labels pointing to this version
Config (model parameters, etc.)
Metadata (name, type, etc.)

Compile prompt with variables

Fill in variables to create the actual prompt sent to the LLM:

compiled_prompt = prompt.compile(
    criticlevel="expert",
    movie="Dune 2"
)
# Result: "As an expert movie critic, do you like Dune 2?"

The compiled prompt is the actual text sent to the LLM, while the original prompt object retains metadata for linking.

Link prompt to generation span

When creating the LLM generation span, pass the prompt object:Using decorators:

@observe(as_type="generation")
def call_llm():
    prompt = abv.get_prompt("movie-critic")
    abv.update_current_generation(prompt=prompt)  # Link prompt
    # ... make LLM call ...

Using context managers:

with abv.start_as_current_observation(
    as_type='generation',
    name="movie-generation",
    model="gpt-4o",
    prompt=prompt  # Link prompt to this generation
) as generation:
    # ... make LLM call ...
    generation.update(output=response)

ABV extracts prompt name, version, and labels, associating them with the generation span.

Automatic metric aggregation

Once linked, ABV automatically aggregates metrics by prompt version:

Quality scores: Average scores grouped by prompt version
Latency: Median, p95, p99 latency by prompt version
Token usage: Input tokens, output tokens, total tokens
Costs: Calculated from token usage and model pricing
Volume: Count of generations per prompt version
Timestamps: First and last generation for each version

Access metrics: Navigate to the prompt in the ABV dashboard and click the Metrics tab to view aggregated performance by version.

Compare prompt versions

Use metrics to compare prompt versions:

Side-by-side comparison: Select two versions to compare quality, latency, and costs
Time series charts: See how metrics evolved across prompt deployments
Regression detection: Identify when a new prompt version degraded performance
A/B test analysis: Compare concurrent versions running in A/B tests

Data-driven decisions: Promote versions that improve quality, roll back versions that degrade performance, and iterate based on measurable outcomes.

Implementation by SDK

Complete integration examples for Python and JavaScript/TypeScript:

Python SDK

Install dependencies:

pip install abvdev openai

Using decorators (recommended for simplicity):

from abvdev import ABV, observe

abv = ABV(
    api_key="sk-abv-...",
    host="https://app.abv.dev",
)

@observe(as_type="generation")
def call_movie_critic():
    # Fetch prompt
    prompt = abv.get_prompt("movie-critic")

    # Link prompt to current generation
    abv.update_current_generation(prompt=prompt)

    # Compile prompt with variables
    compiled_prompt = prompt.compile(
        criticlevel="expert",
        movie="Dune 2"
    )

    # Make LLM call (example with OpenAI)
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": compiled_prompt}]
    )

    return response.choices[0].message.content

@observe()
def main():
    result = call_movie_critic()
    print(result)

main()

Using context managers (recommended for more control):

from abvdev import ABV
from openai import OpenAI

abv = ABV(
    api_key="sk-abv-...",
    host="https://app.abv.dev",
)

openai_client = OpenAI(api_key="sk-proj-...")

# Fetch prompt
prompt = abv.get_prompt("movie-critic")

# Compile prompt
compiled_prompt = prompt.compile(
    criticlevel="expert",
    movie="The Lord of the Rings"
)

# Create generation span with linked prompt
with abv.start_as_current_observation(
    as_type='generation',
    name="movie-generation",
    model="gpt-4o",
    prompt=prompt  # Link prompt here
) as generation:
    # Make LLM call
    response = openai_client.chat.completions.create(
        messages=[{"role": "user", "content": compiled_prompt}],
        model="gpt-4o",
    )

    # Update generation with output
    generation.update(output=response.choices[0].message.content)

If a fallback prompt is used (when ABV is unavailable), no link will be created to preserve application reliability.

JavaScript/TypeScript SDK

Install dependencies:

npm install @abvdev/client @abvdev/tracing @abvdev/otel @opentelemetry/sdk-node dotenv

Set up environment variables (.env file):

ABV_API_KEY=sk-abv-...
ABV_BASE_URL=https://app.abv.dev  # US region
# ABV_BASE_URL=https://eu.app.abv.dev  # EU region

Create instrumentation file (instrumentation.ts):

import dotenv from "dotenv";
dotenv.config();

import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [
    new ABVSpanProcessor({
      apiKey: process.env.ABV_API_KEY,
      baseUrl: process.env.ABV_BASE_URL,
      exportMode: "immediate",
      flushAt: 1,
      flushInterval: 1,
      additionalHeaders: {
        "Content-Type": "application/json",
        "Accept": "application/json"
      }
    })
  ],
});

sdk.start();

Using manual observations:

import "./instrumentation"; // Must be the first import
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";

const abv = new ABVClient();

async function main() {
  // Fetch prompt
  const prompt = await abv.prompt.get("movie-critic");

  // Create generation span
  const generation = startObservation(
    "llm",
    {
      input: prompt.prompt,  // Include prompt in span
    },
    { asType: "generation" },
  );

  // Your LLM call here
  // ...

  generation.end();
}

main();

Using context manager:

import "./instrumentation";
import { ABVClient } from "@abvdev/client";
import { startActiveObservation } from "@abvdev/tracing";

const abv = new ABVClient();

startActiveObservation(
  "llm",
  async (generation) => {
    // Fetch prompt
    const prompt = await abv.prompt.get("movie-critic");

    // Link prompt to generation
    generation.update({ input: prompt.prompt });

    // Make LLM call
    // ...
  },
  { asType: "generation" },
);

Using observe wrapper:

import "./instrumentation";
import { ABVClient } from "@abvdev/client";
import { observe, updateActiveObservation } from "@abvdev/tracing";

const abv = new ABVClient();

const callLLM = async (input: string) => {
  // Fetch prompt
  const prompt = await abv.prompt.get("my-prompt");

  // Link prompt to current generation
  updateActiveObservation({ prompt }, { asType: "generation" });

  // Make LLM call
  return await invokeLLM(input);
};

export const observedCallLLM = observe(callLLM);

If a fallback prompt is used, no link will be created.

Metrics Available by Prompt Version

Once prompts are linked to traces, ABV tracks the following metrics by prompt version:

Quality Metrics

Score aggregation: ABV aggregates all score types by prompt version:

User feedback scores: Thumbs up/down, ratings, satisfaction surveys
Model-based scores: Automated evaluation scores (relevance, correctness, safety)
Human-in-the-loop scores: Expert annotations on sampled traces
Custom scores: Application-specific quality metrics

Aggregations:

Median score value per prompt version
Score distribution (min, max, percentiles)
Score trend over time for each version

Use cases:

Compare quality between prompt versions: “Did version 3 improve scores vs. version 2?”
Identify regressions: “Version 5 has lower quality scores than version 4”
Validate A/B test winners: “Variant A has statistically higher scores than variant B”

Learn more about scores →

Performance Metrics

Latency tracking: ABV measures generation latency by prompt version:

Median generation latency: Typical response time
p95/p99 latency: Tail latency for worst-case analysis
Time-to-first-token: For streaming responses
Generation count: Volume of requests per version

Token usage:

Median input tokens per generation
Median output tokens per generation
Total tokens consumed by prompt version

Use cases:

Identify slow prompts: “Version 4 has 50ms higher latency than version 3”
Optimize token usage: “This prompt variation uses 30% fewer tokens”
Track performance trends: “Latency increased after deploying version 5”

Cost Metrics

Cost calculation: ABV calculates costs by prompt version based on token usage and model pricing:

Median generation cost (per request)
Total cost by prompt version
Cost breakdown: Input tokens vs. output tokens
Cost trends over time

Use cases:

Cost optimization: “Version 3 costs 20% less than version 2 due to shorter prompts”
Budget tracking: “This prompt version costs $500/day in production”
ROI analysis: “Higher quality version costs $100/day more but reduces support tickets”

Temporal Metrics

Timestamp tracking: ABV records when each prompt version was used:

First generation timestamp for version
Last generation timestamp for version
Time series: Generations per day/hour/minute

Use cases:

Deployment tracking: “Version 4 went live at 2pm yesterday”
Adoption analysis: “Version 3 still receiving 10% of traffic due to caching”
Incident correlation: “Quality degradation started at 3pm when version 5 deployed”

Using Metrics for Prompt Optimization

Practical workflows leveraging prompt-to-trace metrics:

Validating Prompt Improvements

Scenario: You improved a prompt and want to verify it actually performs better in production.Workflow:

Baseline: Current version (v2) in production with historical metrics
Deploy new version: Create v3, deploy to production with production label
Collect data: Run v3 for 24-48 hours to accumulate sufficient samples
Compare metrics: Navigate to prompt in ABV dashboard, compare v2 vs. v3
- Quality scores: Did median score improve?
- Latency: Did response time change?
- Costs: Did token usage increase or decrease?
Decision:
- If v3 improves quality without degrading latency/cost: Keep v3
- If v3 degrades quality or increases cost too much: Roll back to v2
- If results are mixed: Run longer A/B test for statistical significance

Benefits: Objective validation rather than subjective assessment, data-driven decisions.

Debugging Quality Regressions

Scenario: Quality metrics dropped after a recent deployment. You need to identify which prompt change caused the issue.Workflow:

Identify regression window: Check metrics dashboard to see when scores dropped
Review prompt history: View prompt versions deployed during that time period
Compare versions: Use diff view to see what changed between versions
Correlate with metrics: Match deployment timestamps with metric changes

Reproduce issue: Fetch the suspect version and test locally:

suspect_version = abv.get_prompt("movie-critic", version=5)
good_version = abv.get_prompt("movie-critic", version=4)
# Compare outputs for same inputs

Root cause analysis: Identify specific prompt change that caused regression
Fix and redeploy: Create new version with fix, validate in staging, deploy

Benefits: Fast incident resolution, clear audit trail, reproducible debugging.

A/B Testing Prompt Variants

Scenario: You have two prompt variants and want to determine which performs better.Workflow:

Create variants:
- v2: Variant A, assign variant-a label
- v3: Variant B, assign variant-b label

Implement randomization:

import random
variant = random.choice(["variant-a", "variant-b"])
prompt = abv.get_prompt("movie-critic", label=variant)
abv.update_current_generation(prompt=prompt)  # Link for tracking

Collect data: Run for days/weeks to achieve statistical power
Analyze results: Compare metrics by prompt version:
- Quality: v2 median score 4.2/5, v3 median score 4.5/5
- Latency: v2 median 450ms, v3 median 480ms (slightly slower)
- Cost: v2 median $0.003, v3 median$ 0.004 (20% more expensive)
Calculate significance: Use statistical tests to validate results
Promote winner: Reassign production label to better variant

Learn more about A/B testing →

Monitoring Production Prompts

Scenario: Set up continuous monitoring of production prompts to detect issues early.Setup:

Link all prompts to traces: Ensure all generation spans include prompt metadata
Configure dashboards: Create custom dashboards showing:
- Quality trends over time for production prompt version
- Latency p95/p99 for production version
- Cost per day for production version
- Volume (generations/day) for production version
Set up alerts: Configure alerts for:
- Quality score drops below threshold
- Latency increases above threshold
- Cost per generation exceeds budget
- Generation volume spikes or drops unexpectedly
Regular review: Weekly review of prompt metrics to identify optimization opportunities

Benefits: Proactive issue detection, continuous optimization, cost control.

Metrics Reference

Complete list of metrics tracked when prompts are linked to traces:

Metric	Description	Aggregation
Median generation latency	Median time from generation start to end	Median across all generations
Median input tokens	Median count of input tokens	Median across all generations
Median output tokens	Median count of output tokens	Median across all generations
Median generation cost	Median cost per generation (input + output tokens)	Median across all generations
Generation count	Total number of generations using this prompt version	Sum
Median score value	Median score across all score types (user, model, human)	Median across all scores
First generation timestamp	When this prompt version was first used	Earliest timestamp
Last generation timestamp	When this prompt version was most recently used	Latest timestamp
Quality trend	Change in median score over time	Time series
Cost trend	Change in median cost over time	Time series
Latency trend	Change in median latency over time	Time series

Next Steps

Get Started with Prompts

Create your first prompt and link it to traces

Version Control

Manage prompt versions and labels for deployment

A/B Testing

Compare prompt versions with A/B testing workflows

Scores Data Model

Understand score types and how they aggregate by prompt version

Observability & Tracing

Learn more about generation spans and observability instrumentation

Metrics Dashboard

Explore metrics beyond prompt-specific tracking

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

Link Prompts to Traces

How Linking Prompts to Traces Works

Implementation by SDK

Metrics Available by Prompt Version

Using Metrics for Prompt Optimization

Metrics Reference

Next Steps

Get Started with Prompts

Version Control

A/B Testing

Scores Data Model

Observability & Tracing

Metrics Dashboard

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​How Linking Prompts to Traces Works

​Implementation by SDK

​Metrics Available by Prompt Version

​Using Metrics for Prompt Optimization

​Metrics Reference

​Next Steps

Get Started with Prompts

Version Control

A/B Testing

Scores Data Model

Observability & Tracing

Metrics Dashboard

How Linking Prompts to Traces Works

Implementation by SDK

Metrics Available by Prompt Version

Using Metrics for Prompt Optimization

Metrics Reference

Next Steps