Skip to main content
Linking prompts to traces enables tracking which prompt version generated each LLM response. This connection powers metrics by prompt version, comparison across versions, and data-driven iteration—transforming prompt management from guesswork into systematic optimization.

How Linking Prompts to Traces Works

Understanding the integration between prompt management and observability:

Fetch prompt from ABV

Your application fetches the prompt at runtime using the ABV SDK:
prompt = abv.get_prompt("movie-critic")  # Fetches production version
The prompt object contains:
  • Prompt content (with variables)
  • Version number
  • Labels pointing to this version
  • Config (model parameters, etc.)
  • Metadata (name, type, etc.)

Compile prompt with variables

Fill in variables to create the actual prompt sent to the LLM:
compiled_prompt = prompt.compile(
    criticlevel="expert",
    movie="Dune 2"
)
# Result: "As an expert movie critic, do you like Dune 2?"
The compiled prompt is the actual text sent to the LLM, while the original prompt object retains metadata for linking.

Link prompt to generation span

When creating the LLM generation span, pass the prompt object:Using decorators:
@observe(as_type="generation")
def call_llm():
    prompt = abv.get_prompt("movie-critic")
    abv.update_current_generation(prompt=prompt)  # Link prompt
    # ... make LLM call ...
Using context managers:
with abv.start_as_current_observation(
    as_type='generation',
    name="movie-generation",
    model="gpt-4o",
    prompt=prompt  # Link prompt to this generation
) as generation:
    # ... make LLM call ...
    generation.update(output=response)
ABV extracts prompt name, version, and labels, associating them with the generation span.

Automatic metric aggregation

Once linked, ABV automatically aggregates metrics by prompt version:
  • Quality scores: Average scores grouped by prompt version
  • Latency: Median, p95, p99 latency by prompt version
  • Token usage: Input tokens, output tokens, total tokens
  • Costs: Calculated from token usage and model pricing
  • Volume: Count of generations per prompt version
  • Timestamps: First and last generation for each version
Access metrics: Navigate to the prompt in the ABV dashboard and click the Metrics tab to view aggregated performance by version.

Compare prompt versions

Use metrics to compare prompt versions:
  • Side-by-side comparison: Select two versions to compare quality, latency, and costs
  • Time series charts: See how metrics evolved across prompt deployments
  • Regression detection: Identify when a new prompt version degraded performance
  • A/B test analysis: Compare concurrent versions running in A/B tests
Data-driven decisions: Promote versions that improve quality, roll back versions that degrade performance, and iterate based on measurable outcomes.

Implementation by SDK

Complete integration examples for Python and JavaScript/TypeScript:
Install dependencies:
pip install abvdev openai
Using decorators (recommended for simplicity):
from abvdev import ABV, observe

abv = ABV(
    api_key="sk-abv-...",
    host="https://app.abv.dev",
)

@observe(as_type="generation")
def call_movie_critic():
    # Fetch prompt
    prompt = abv.get_prompt("movie-critic")

    # Link prompt to current generation
    abv.update_current_generation(prompt=prompt)

    # Compile prompt with variables
    compiled_prompt = prompt.compile(
        criticlevel="expert",
        movie="Dune 2"
    )

    # Make LLM call (example with OpenAI)
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": compiled_prompt}]
    )

    return response.choices[0].message.content

@observe()
def main():
    result = call_movie_critic()
    print(result)

main()
Using context managers (recommended for more control):
from abvdev import ABV
from openai import OpenAI

abv = ABV(
    api_key="sk-abv-...",
    host="https://app.abv.dev",
)

openai_client = OpenAI(api_key="sk-proj-...")

# Fetch prompt
prompt = abv.get_prompt("movie-critic")

# Compile prompt
compiled_prompt = prompt.compile(
    criticlevel="expert",
    movie="The Lord of the Rings"
)

# Create generation span with linked prompt
with abv.start_as_current_observation(
    as_type='generation',
    name="movie-generation",
    model="gpt-4o",
    prompt=prompt  # Link prompt here
) as generation:
    # Make LLM call
    response = openai_client.chat.completions.create(
        messages=[{"role": "user", "content": compiled_prompt}],
        model="gpt-4o",
    )

    # Update generation with output
    generation.update(output=response.choices[0].message.content)
If a fallback prompt is used (when ABV is unavailable), no link will be created to preserve application reliability.
Install dependencies:
npm install @abvdev/client @abvdev/tracing @abvdev/otel @opentelemetry/sdk-node dotenv
Set up environment variables (.env file):
ABV_API_KEY=sk-abv-...
ABV_BASE_URL=https://app.abv.dev  # US region
# ABV_BASE_URL=https://eu.app.abv.dev  # EU region
Create instrumentation file (instrumentation.ts):
import dotenv from "dotenv";
dotenv.config();

import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [
    new ABVSpanProcessor({
      apiKey: process.env.ABV_API_KEY,
      baseUrl: process.env.ABV_BASE_URL,
      exportMode: "immediate",
      flushAt: 1,
      flushInterval: 1,
      additionalHeaders: {
        "Content-Type": "application/json",
        "Accept": "application/json"
      }
    })
  ],
});

sdk.start();
Using manual observations:
import "./instrumentation"; // Must be the first import
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";

const abv = new ABVClient();

async function main() {
  // Fetch prompt
  const prompt = await abv.prompt.get("movie-critic");

  // Create generation span
  const generation = startObservation(
    "llm",
    {
      input: prompt.prompt,  // Include prompt in span
    },
    { asType: "generation" },
  );

  // Your LLM call here
  // ...

  generation.end();
}

main();
Using context manager:
import "./instrumentation";
import { ABVClient } from "@abvdev/client";
import { startActiveObservation } from "@abvdev/tracing";

const abv = new ABVClient();

startActiveObservation(
  "llm",
  async (generation) => {
    // Fetch prompt
    const prompt = await abv.prompt.get("movie-critic");

    // Link prompt to generation
    generation.update({ input: prompt.prompt });

    // Make LLM call
    // ...
  },
  { asType: "generation" },
);
Using observe wrapper:
import "./instrumentation";
import { ABVClient } from "@abvdev/client";
import { observe, updateActiveObservation } from "@abvdev/tracing";

const abv = new ABVClient();

const callLLM = async (input: string) => {
  // Fetch prompt
  const prompt = await abv.prompt.get("my-prompt");

  // Link prompt to current generation
  updateActiveObservation({ prompt }, { asType: "generation" });

  // Make LLM call
  return await invokeLLM(input);
};

export const observedCallLLM = observe(callLLM);
If a fallback prompt is used, no link will be created.

Metrics Available by Prompt Version

Once prompts are linked to traces, ABV tracks the following metrics by prompt version:
Score aggregation: ABV aggregates all score types by prompt version:
  • User feedback scores: Thumbs up/down, ratings, satisfaction surveys
  • Model-based scores: Automated evaluation scores (relevance, correctness, safety)
  • Human-in-the-loop scores: Expert annotations on sampled traces
  • Custom scores: Application-specific quality metrics
Aggregations:
  • Median score value per prompt version
  • Score distribution (min, max, percentiles)
  • Score trend over time for each version
Use cases:
  • Compare quality between prompt versions: “Did version 3 improve scores vs. version 2?”
  • Identify regressions: “Version 5 has lower quality scores than version 4”
  • Validate A/B test winners: “Variant A has statistically higher scores than variant B”
Learn more about scores →
Latency tracking: ABV measures generation latency by prompt version:
  • Median generation latency: Typical response time
  • p95/p99 latency: Tail latency for worst-case analysis
  • Time-to-first-token: For streaming responses
  • Generation count: Volume of requests per version
Token usage:
  • Median input tokens per generation
  • Median output tokens per generation
  • Total tokens consumed by prompt version
Use cases:
  • Identify slow prompts: “Version 4 has 50ms higher latency than version 3”
  • Optimize token usage: “This prompt variation uses 30% fewer tokens”
  • Track performance trends: “Latency increased after deploying version 5”
Cost calculation: ABV calculates costs by prompt version based on token usage and model pricing:
  • Median generation cost (per request)
  • Total cost by prompt version
  • Cost breakdown: Input tokens vs. output tokens
  • Cost trends over time
Use cases:
  • Cost optimization: “Version 3 costs 20% less than version 2 due to shorter prompts”
  • Budget tracking: “This prompt version costs $500/day in production”
  • ROI analysis: “Higher quality version costs $100/day more but reduces support tickets”
Timestamp tracking: ABV records when each prompt version was used:
  • First generation timestamp for version
  • Last generation timestamp for version
  • Time series: Generations per day/hour/minute
Use cases:
  • Deployment tracking: “Version 4 went live at 2pm yesterday”
  • Adoption analysis: “Version 3 still receiving 10% of traffic due to caching”
  • Incident correlation: “Quality degradation started at 3pm when version 5 deployed”

Using Metrics for Prompt Optimization

Practical workflows leveraging prompt-to-trace metrics:
Scenario: You improved a prompt and want to verify it actually performs better in production.Workflow:
  1. Baseline: Current version (v2) in production with historical metrics
  2. Deploy new version: Create v3, deploy to production with production label
  3. Collect data: Run v3 for 24-48 hours to accumulate sufficient samples
  4. Compare metrics: Navigate to prompt in ABV dashboard, compare v2 vs. v3
    • Quality scores: Did median score improve?
    • Latency: Did response time change?
    • Costs: Did token usage increase or decrease?
  5. Decision:
    • If v3 improves quality without degrading latency/cost: Keep v3
    • If v3 degrades quality or increases cost too much: Roll back to v2
    • If results are mixed: Run longer A/B test for statistical significance
Benefits: Objective validation rather than subjective assessment, data-driven decisions.
Scenario: Quality metrics dropped after a recent deployment. You need to identify which prompt change caused the issue.Workflow:
  1. Identify regression window: Check metrics dashboard to see when scores dropped
  2. Review prompt history: View prompt versions deployed during that time period
  3. Compare versions: Use diff view to see what changed between versions
  4. Correlate with metrics: Match deployment timestamps with metric changes
  5. Reproduce issue: Fetch the suspect version and test locally:
    suspect_version = abv.get_prompt("movie-critic", version=5)
    good_version = abv.get_prompt("movie-critic", version=4)
    # Compare outputs for same inputs
    
  6. Root cause analysis: Identify specific prompt change that caused regression
  7. Fix and redeploy: Create new version with fix, validate in staging, deploy
Benefits: Fast incident resolution, clear audit trail, reproducible debugging.
Scenario: You have two prompt variants and want to determine which performs better.Workflow:
  1. Create variants:
    • v2: Variant A, assign variant-a label
    • v3: Variant B, assign variant-b label
  2. Implement randomization:
    import random
    variant = random.choice(["variant-a", "variant-b"])
    prompt = abv.get_prompt("movie-critic", label=variant)
    abv.update_current_generation(prompt=prompt)  # Link for tracking
    
  3. Collect data: Run for days/weeks to achieve statistical power
  4. Analyze results: Compare metrics by prompt version:
    • Quality: v2 median score 4.2/5, v3 median score 4.5/5
    • Latency: v2 median 450ms, v3 median 480ms (slightly slower)
    • Cost: v2 median 0.003,v3median0.003, v3 median 0.004 (20% more expensive)
  5. Calculate significance: Use statistical tests to validate results
  6. Promote winner: Reassign production label to better variant
Learn more about A/B testing →
Scenario: Set up continuous monitoring of production prompts to detect issues early.Setup:
  1. Link all prompts to traces: Ensure all generation spans include prompt metadata
  2. Configure dashboards: Create custom dashboards showing:
    • Quality trends over time for production prompt version
    • Latency p95/p99 for production version
    • Cost per day for production version
    • Volume (generations/day) for production version
  3. Set up alerts: Configure alerts for:
    • Quality score drops below threshold
    • Latency increases above threshold
    • Cost per generation exceeds budget
    • Generation volume spikes or drops unexpectedly
  4. Regular review: Weekly review of prompt metrics to identify optimization opportunities
Benefits: Proactive issue detection, continuous optimization, cost control.

Metrics Reference

Complete list of metrics tracked when prompts are linked to traces:
MetricDescriptionAggregation
Median generation latencyMedian time from generation start to endMedian across all generations
Median input tokensMedian count of input tokensMedian across all generations
Median output tokensMedian count of output tokensMedian across all generations
Median generation costMedian cost per generation (input + output tokens)Median across all generations
Generation countTotal number of generations using this prompt versionSum
Median score valueMedian score across all score types (user, model, human)Median across all scores
First generation timestampWhen this prompt version was first usedEarliest timestamp
Last generation timestampWhen this prompt version was most recently usedLatest timestamp
Quality trendChange in median score over timeTime series
Cost trendChange in median cost over timeTime series
Latency trendChange in median latency over timeTime series

Next Steps