A/B testing (also called split testing) enables comparing two or more prompt versions in production with real users and use cases. Rather than choosing between prompts based on intuition or small-scale testing, A/B testing provides statistical evidence about which prompt performs better under real-world conditions.

When to Use A/B Testing

A/B testing is powerful but not appropriate for every situation:

Ideal Use Cases

Consumer applications with high volume:

Applications with thousands of daily users (sufficient sample size)
Use cases where small quality variations are acceptable
Scenarios where you can collect quality signals (user feedback, automated scores)

Canary deployments:

You’ve validated improvements on test datasets
You want to verify production performance before full rollout
You can monitor metrics in real-time to catch issues early

Optimization iterations:

Incremental prompt improvements where directional changes are clear
Testing hypotheses about what drives quality (tone, length, structure)
Comparing prompts with similar expected performance

Examples: Chatbot greeting messages, content summarization, code completion suggestions, product recommendations

Avoid A/B Testing For

Mission-critical applications:

Healthcare decisions (potential patient harm)
Financial transactions (regulatory requirements)
Legal advice (liability concerns)
Safety-critical systems (autonomous vehicles, industrial controls)

Low-volume applications:

Fewer than 100 daily users (insufficient statistical power)
Use cases with long feedback cycles (weeks between samples)
Scenarios where each request is unique (no aggregate patterns)

High-stakes accuracy requirements:

Applications where any error is unacceptable
Regulated industries with strict compliance requirements
Use cases requiring deterministic outputs

Alternative: For these scenarios, use comprehensive offline evaluation on datasets before deploying to production, then monitor with 100% production traffic rather than split testing.

Prerequisites for Successful A/B Testing

Before starting A/B testing, ensure you have:

Measurable success metrics: Quality scores, user feedback, task completion rates, or business outcomes
Sufficient traffic volume: At least 100-200 samples per variant for statistical significance
Prompt linking infrastructure: Ability to link prompts to traces for metric aggregation
Monitoring dashboards: Real-time visibility into quality metrics by prompt version
Rollback capability: Ability to stop the test and revert if issues arise
Statistical analysis skills: Understanding of significance testing, confidence intervals, and statistical power

Without these prerequisites, A/B testing becomes guesswork rather than scientific experimentation.

How A/B Testing Works

Complete workflow from setup to decision: The A/B testing lifecycle: Create variants → Implement random assignment → Collect sufficient data → Analyze for statistical significance → Make deployment decision → Monitor results.

Create prompt variants and assign labels

Create two (or more) prompt versions with different content, structure, or parameters:Via ABV UI:

Navigate to your prompt in the ABV dashboard
Create a new version with variant A content
Assign label variant-a (or prod-a)
Create another version with variant B content
Assign label variant-b (or prod-b)

Via SDK:

# Create variant A
abv.create_prompt(
    name="movie-critic",
    prompt="As a {{criticlevel}} movie critic, provide a detailed review of {{movie}}.",
    labels=["variant-a"],
    config={"temperature": 0.7}
)

# Create variant B
abv.create_prompt(
    name="movie-critic",
    prompt="You're a {{criticlevel}} film critic. Share your thoughts on {{movie}}.",
    labels=["variant-b"],
    config={"temperature": 0.8}
)

Version numbers: ABV automatically assigns incremental version numbers (e.g., versions 3 and 4), but you’ll reference by label in your code.

Implement randomized assignment in application code

Modify your application to randomly select between variants for each request:Python implementation:

from abvdev import ABV
from openai import OpenAI
import random

abv = ABV(api_key="sk-abv-...", host="https://app.abv.dev")
openai_client = OpenAI(api_key="sk-proj-...")

# Fetch both variants
prompt_a = abv.get_prompt("movie-critic", label="variant-a")
prompt_b = abv.get_prompt("movie-critic", label="variant-b")

# Randomly select variant (50/50 split)
selected_prompt = random.choice([prompt_a, prompt_b])

# Compile and use
compiled_prompt = selected_prompt.compile(
    criticlevel="expert",
    movie="Dune 2"
)

# Link prompt to trace for metric tracking
with abv.start_as_current_observation(
    as_type="generation",
    name="movie-review",
    prompt=selected_prompt  # Crucial: link for metrics
) as generation:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": compiled_prompt}]
    )
    generation.update(output=response.choices[0].message.content)

abv.flush()  # For short-lived applications

TypeScript/JavaScript implementation:

import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";
import OpenAI from "openai";

const abv = new ABVClient();
const openai = new OpenAI();

async function main() {
  // Fetch both variants
  const promptA = await abv.prompt.get("movie-critic", {
    label: "variant-a",
  });

  const promptB = await abv.prompt.get("movie-critic", {
    label: "variant-b",
  });

  // Randomly select variant (50/50 split)
  const selectedPrompt = Math.random() < 0.5 ? promptA : promptB;

  // Create generation span with linked prompt
  const generation = startObservation(
    "movie-review",
    {
      model: "gpt-4o",
      input: selectedPrompt.compile({
        criticlevel: "expert",
        movie: "Dune 2"
      }),
      prompt: selectedPrompt  // Link for metrics
    },
    { asType: "generation" }
  );

  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: selectedPrompt.compile({
      criticlevel: "expert",
      movie: "Dune 2"
    }),
  });

  generation.update({
    output: { content: completion.choices[0].message.content },
  });

  generation.end();
}

main();

Traffic split ratios: Use 50/50 for equal comparison, or adjust ratios (e.g., 90/10 for cautious canary deployment).

Collect data over sufficient time period

Run the A/B test until you’ve collected enough data for statistical significance:Minimum sample size:

At least 100-200 generations per variant
More samples for smaller expected differences
Use online sample size calculators for precise requirements

Time period:

Run for multiple days to account for day-of-week effects
Include weekdays and weekends if usage patterns differ
Ensure you capture diverse user segments and use cases

Monitor during collection:

Watch dashboards for unexpected issues
Check that traffic is splitting as expected
Verify metrics are being collected for both variants

Early stopping criteria: Stop the test early if:

One variant shows severe quality degradation
Error rates spike for one variant
Statistical significance is achieved with clear winner

Analyze results and calculate significance

Navigate to the prompt in ABV dashboard and compare metrics by version:Key metrics to compare:

Quality scores: Median score, score distribution by variant
Latency: Median, p95, p99 response times
Token usage: Input tokens, output tokens (affects cost)
Cost: Median cost per generation
User feedback: Thumbs up/down ratios, satisfaction ratings

Statistical significance:

Use significance tests (t-test, Mann-Whitney U test) to determine if differences are real
Calculate confidence intervals (95% CI recommended)
Consider practical significance: Is the improvement meaningful even if statistically significant?

Example analysis:

Variant A:
- Median quality score: 4.2/5
- Median latency: 450ms
- Median cost: $0.003
- Samples: 1,250

Variant B:
- Median quality score: 4.5/5 (7% improvement)
- Median latency: 480ms (6% slower)
- Median cost: $0.004 (33% more expensive)
- Samples: 1,238

Statistical significance: p < 0.05 (quality improvement is significant)
Decision: Variant B improves quality but at higher cost. Evaluate tradeoff.

Tools for analysis: Use Python (scipy, statsmodels), R, or online calculators for significance testing.

Make decision and deploy winner

Based on analysis, choose the winning variant:Clear winner:

Variant significantly better on primary metric (quality)
No significant degradation on secondary metrics (cost, latency)
Action: Promote winner to production by reassigning production label

Mixed results:

Variant better on quality but worse on cost
Small improvement with high uncertainty
Action: Evaluate tradeoffs, possibly run longer test, or choose based on business priorities

No significant difference:

Variants perform similarly across all metrics
Action: Keep existing version (simpler) or choose based on maintenance/cost

Deployment:

# After deciding variant-b is the winner, promote via UI or SDK:
abv.update_prompt(
    name="movie-critic",
    version=4,  # variant-b version number
    new_labels=["production"]  # Assign production label
)

Post-deployment monitoring: Continue monitoring quality after full rollout to ensure results hold at 100% traffic.

Implementation Examples

Complete examples for both SDKs:

Python SDK Implementation

Complete A/B testing implementation:

from abvdev import ABV
from openai import OpenAI
import random
import os

# Initialize clients
abv = ABV(
    api_key=os.getenv("ABV_API_KEY"),
    host="https://app.abv.dev",
)
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def run_ab_test(user_input: dict):
    """
    Run A/B test for movie critic prompt.

    Args:
        user_input: Dict with 'criticlevel' and 'movie' keys

    Returns:
        LLM response
    """
    # Fetch both variants
    prompt_a = abv.get_prompt("movie-critic", label="variant-a")
    prompt_b = abv.get_prompt("movie-critic", label="variant-b")

    # Randomly assign user to variant (50/50 split)
    selected_prompt = random.choice([prompt_a, prompt_b])

    # Compile prompt with user input
    compiled_prompt = selected_prompt.compile(
        criticlevel=user_input["criticlevel"],
        movie=user_input["movie"]
    )

    # Create generation with linked prompt
    with abv.start_as_current_observation(
        as_type="generation",
        name="movie-review-ab-test",
        prompt=selected_prompt  # Link for tracking by version
    ) as generation:
        # Call LLM
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": compiled_prompt}]
        )

        result = response.choices[0].message.content

        # Update generation with output
        generation.update(output=result)

        return result

# Usage
if __name__ == "__main__":
    result = run_ab_test({
        "criticlevel": "expert",
        "movie": "The Lord of the Rings"
    })
    print(result)

    # Flush events for short-lived applications
    abv.flush()

Weighted traffic split (90% control, 10% variant):

# Weighted random selection
selected_prompt = random.choices(
    [prompt_a, prompt_b],
    weights=[0.9, 0.1],  # 90% variant-a, 10% variant-b
    k=1
)[0]

JavaScript/TypeScript SDK Implementation

Complete A/B testing implementation:Setup (instrumentation.ts):

import dotenv from "dotenv";
dotenv.config();

import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [
    new ABVSpanProcessor({
      apiKey: process.env.ABV_API_KEY,
      baseUrl: process.env.ABV_BASE_URL,
      exportMode: "immediate",
      flushAt: 1,
      flushInterval: 1,
    })
  ],
});

sdk.start();

A/B test implementation (index.ts):

import "./instrumentation"; // Must be first import
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";
import OpenAI from "openai";
import dotenv from "dotenv";
dotenv.config();

const abv = new ABVClient();
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

async function runABTest(userInput: {
  criticlevel: string;
  movie: string;
}) {
  // Fetch both variants
  const promptA = await abv.prompt.get("movie-critic", {
    label: "variant-a",
  });

  const promptB = await abv.prompt.get("movie-critic", {
    label: "variant-b",
  });

  // Randomly assign user to variant (50/50 split)
  const selectedPrompt = Math.random() < 0.5 ? promptA : promptB;

  // Compile prompt
  const compiledMessages = selectedPrompt.compile(userInput);

  // Create generation with linked prompt
  const generation = startObservation(
    "movie-review-ab-test",
    {
      model: "gpt-4o",
      input: compiledMessages,
      prompt: selectedPrompt  // Link for tracking
    },
    { asType: "generation" }
  );

  // Call LLM
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: compiledMessages,
  });

  const result = completion.choices[0].message.content;

  // Update generation
  generation.update({
    output: { content: result },
  });

  generation.end();

  return result;
}

// Usage
async function main() {
  const result = await runABTest({
    criticlevel: "expert",
    movie: "The Lord of the Rings"
  });
  console.log(result);
}

main();

Weighted traffic split (90% control, 10% variant):

const selectedPrompt = Math.random() < 0.9 ? promptA : promptB;
// 90% get promptA, 10% get promptB

Statistical Analysis

Understanding statistical concepts for A/B testing:

Statistical Significance (p-value)

Definition: Probability that observed difference occurred by random chance.Interpretation:

p < 0.05: Less than 5% chance results are due to randomness (commonly used threshold)
p < 0.01: Less than 1% chance (stronger evidence)
p > 0.05: Difference not statistically significant (could be random)

Example:

Variant A: median score 4.2
Variant B: median score 4.5
p-value: 0.03
Conclusion: The 0.3 point improvement is statistically significant (p < 0.05)

Caution: Significance doesn’t guarantee practical importance. Always consider effect size.

Confidence Intervals

Definition: Range where the true value likely falls.Interpretation:

95% CI: We’re 95% confident the true value is in this range
Wider intervals indicate more uncertainty
Non-overlapping intervals suggest significant difference

Example:

Variant A: median score 4.2, 95% CI [4.0, 4.4]
Variant B: median score 4.5, 95% CI [4.3, 4.7]
Conclusion: Intervals don’t overlap—variant B is likely better

Use: Provides intuition about uncertainty in results, complements p-values.

Sample Size and Statistical Power

Statistical power: Probability of detecting a real difference if it exists.Factors affecting required sample size:

Effect size: Smaller differences need more samples
Baseline variance: Higher variance needs more samples
Desired power: Higher power (80-90% recommended) needs more samples
Significance level: Stricter thresholds (p < 0.01) need more samples

Example calculation (simplified):

Baseline score: 4.0 (std dev 1.0)
Expected improvement: 10% (0.4 points)
Desired power: 80%
Significance: 0.05
Required samples: ~400 per variant

Tools: Use online calculators (Evan’s Awesome A/B Tools, Optimizely Sample Size Calculator) for precise calculations.

Common Statistical Tests

For continuous metrics (quality scores, latency):

t-test: Compares means, assumes normal distribution
Mann-Whitney U test: Compares medians, no distribution assumption (recommended for scores)

For binary metrics (thumbs up/down, success/failure):

Chi-square test: Compares proportions
Fisher’s exact test: For small sample sizes

For count data (errors, conversions):

Poisson test: Compares event rates

Python example (Mann-Whitney U test):

from scipy import stats

variant_a_scores = [4.2, 4.0, 4.5, 4.1, ...]  # 400 scores
variant_b_scores = [4.5, 4.3, 4.7, 4.4, ...]  # 400 scores

statistic, p_value = stats.mannwhitneyu(
    variant_a_scores,
    variant_b_scores,
    alternative='two-sided'
)

print(f"p-value: {p_value}")
if p_value < 0.05:
    print("Statistically significant difference")
else:
    print("No significant difference")

Common Pitfalls to Avoid

Stopping Tests Too Early

Problem: Declaring a winner after 50 samples because variant B looks better.Why it’s wrong: Small samples have high variance. Early results often don’t hold with more data.Solution: Pre-commit to minimum sample size (100-200+ per variant) before looking at results. Use sequential testing methods if you must peek early.

P-Hacking (Data Dredging)

Problem: Running multiple tests on the same data until you find statistical significance.Example: Testing 20 different metrics, finding that 1 is significant at p < 0.05 (expected by chance).Solution: Pre-register your primary metric before starting the test. Treat secondary metrics as exploratory only.

Ignoring Practical Significance

Problem: Deploying a variant because it’s statistically better, even though the improvement is tiny.Example: p < 0.01 but quality improves only 0.5% while cost increases 30%.Solution: Set minimum thresholds for practical significance before the test. Consider cost-benefit tradeoffs.

Not Linking Prompts to Traces

Problem: Implementing A/B test but forgetting to link prompts to generation spans.Result: ABV can’t aggregate metrics by prompt version. You have no way to compare variants.Solution: Always pass prompt=selected_prompt when creating generation spans:

with abv.start_as_current_observation(
    as_type="generation",
    prompt=selected_prompt  # Don't forget this!
) as generation:
    ...

Confounding Variables

Problem: Running variant A during weekdays and variant B during weekends, then concluding B is better.Why it’s wrong: Weekend traffic might differ from weekday traffic. You can’t tell if the difference is due to the prompt or the day of week.Solution: Run variants concurrently with randomized assignment to ensure comparable populations.

Next Steps

Link Prompts to Traces

Essential setup for tracking metrics by prompt version

Version Control

Manage prompt versions and labels for A/B testing

Get Started with Prompts

Create and fetch prompts with the ABV SDK

Prompt Experiments

Offline evaluation as a complement to A/B testing

Scores Data Model

Understand quality scores used in A/B test analysis

Metrics Dashboard

Analyze and visualize A/B test results

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

A/B Testing LLM Prompts

When to Use A/B Testing

How A/B Testing Works

Implementation Examples

Statistical Analysis

Common Pitfalls to Avoid

Next Steps

Link Prompts to Traces

Version Control

Get Started with Prompts

Prompt Experiments

Scores Data Model

Metrics Dashboard

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​When to Use A/B Testing

​How A/B Testing Works

​Implementation Examples

​Statistical Analysis

​Common Pitfalls to Avoid

​Next Steps

Link Prompts to Traces

Version Control

Get Started with Prompts

Prompt Experiments

Scores Data Model

Metrics Dashboard

When to Use A/B Testing

How A/B Testing Works

Implementation Examples

Statistical Analysis

Common Pitfalls to Avoid

Next Steps