Skip to main content
A/B testing (also called split testing) enables comparing two or more prompt versions in production with real users and use cases. Rather than choosing between prompts based on intuition or small-scale testing, A/B testing provides statistical evidence about which prompt performs better under real-world conditions.

When to Use A/B Testing

A/B testing is powerful but not appropriate for every situation:
Consumer applications with high volume:
  • Applications with thousands of daily users (sufficient sample size)
  • Use cases where small quality variations are acceptable
  • Scenarios where you can collect quality signals (user feedback, automated scores)
Canary deployments:
  • You’ve validated improvements on test datasets
  • You want to verify production performance before full rollout
  • You can monitor metrics in real-time to catch issues early
Optimization iterations:
  • Incremental prompt improvements where directional changes are clear
  • Testing hypotheses about what drives quality (tone, length, structure)
  • Comparing prompts with similar expected performance
Examples: Chatbot greeting messages, content summarization, code completion suggestions, product recommendations
Mission-critical applications:
  • Healthcare decisions (potential patient harm)
  • Financial transactions (regulatory requirements)
  • Legal advice (liability concerns)
  • Safety-critical systems (autonomous vehicles, industrial controls)
Low-volume applications:
  • Fewer than 100 daily users (insufficient statistical power)
  • Use cases with long feedback cycles (weeks between samples)
  • Scenarios where each request is unique (no aggregate patterns)
High-stakes accuracy requirements:
  • Applications where any error is unacceptable
  • Regulated industries with strict compliance requirements
  • Use cases requiring deterministic outputs
Alternative: For these scenarios, use comprehensive offline evaluation on datasets before deploying to production, then monitor with 100% production traffic rather than split testing.
Before starting A/B testing, ensure you have:
  1. Measurable success metrics: Quality scores, user feedback, task completion rates, or business outcomes
  2. Sufficient traffic volume: At least 100-200 samples per variant for statistical significance
  3. Prompt linking infrastructure: Ability to link prompts to traces for metric aggregation
  4. Monitoring dashboards: Real-time visibility into quality metrics by prompt version
  5. Rollback capability: Ability to stop the test and revert if issues arise
  6. Statistical analysis skills: Understanding of significance testing, confidence intervals, and statistical power
Without these prerequisites, A/B testing becomes guesswork rather than scientific experimentation.

How A/B Testing Works

Complete workflow from setup to decision: The A/B testing lifecycle: Create variants → Implement random assignment → Collect sufficient data → Analyze for statistical significance → Make deployment decision → Monitor results.

Create prompt variants and assign labels

Create two (or more) prompt versions with different content, structure, or parameters:Via ABV UI:
  1. Navigate to your prompt in the ABV dashboard
  2. Create a new version with variant A content
  3. Assign label variant-a (or prod-a)
  4. Create another version with variant B content
  5. Assign label variant-b (or prod-b)
Via SDK:
# Create variant A
abv.create_prompt(
    name="movie-critic",
    prompt="As a {{criticlevel}} movie critic, provide a detailed review of {{movie}}.",
    labels=["variant-a"],
    config={"temperature": 0.7}
)

# Create variant B
abv.create_prompt(
    name="movie-critic",
    prompt="You're a {{criticlevel}} film critic. Share your thoughts on {{movie}}.",
    labels=["variant-b"],
    config={"temperature": 0.8}
)
Version numbers: ABV automatically assigns incremental version numbers (e.g., versions 3 and 4), but you’ll reference by label in your code.

Implement randomized assignment in application code

Modify your application to randomly select between variants for each request:Python implementation:
from abvdev import ABV
from openai import OpenAI
import random

abv = ABV(api_key="sk-abv-...", host="https://app.abv.dev")
openai_client = OpenAI(api_key="sk-proj-...")

# Fetch both variants
prompt_a = abv.get_prompt("movie-critic", label="variant-a")
prompt_b = abv.get_prompt("movie-critic", label="variant-b")

# Randomly select variant (50/50 split)
selected_prompt = random.choice([prompt_a, prompt_b])

# Compile and use
compiled_prompt = selected_prompt.compile(
    criticlevel="expert",
    movie="Dune 2"
)

# Link prompt to trace for metric tracking
with abv.start_as_current_observation(
    as_type="generation",
    name="movie-review",
    prompt=selected_prompt  # Crucial: link for metrics
) as generation:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": compiled_prompt}]
    )
    generation.update(output=response.choices[0].message.content)

abv.flush()  # For short-lived applications
TypeScript/JavaScript implementation:
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";
import OpenAI from "openai";

const abv = new ABVClient();
const openai = new OpenAI();

async function main() {
  // Fetch both variants
  const promptA = await abv.prompt.get("movie-critic", {
    label: "variant-a",
  });

  const promptB = await abv.prompt.get("movie-critic", {
    label: "variant-b",
  });

  // Randomly select variant (50/50 split)
  const selectedPrompt = Math.random() < 0.5 ? promptA : promptB;

  // Create generation span with linked prompt
  const generation = startObservation(
    "movie-review",
    {
      model: "gpt-4o",
      input: selectedPrompt.compile({
        criticlevel: "expert",
        movie: "Dune 2"
      }),
      prompt: selectedPrompt  // Link for metrics
    },
    { asType: "generation" }
  );

  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: selectedPrompt.compile({
      criticlevel: "expert",
      movie: "Dune 2"
    }),
  });

  generation.update({
    output: { content: completion.choices[0].message.content },
  });

  generation.end();
}

main();
Traffic split ratios: Use 50/50 for equal comparison, or adjust ratios (e.g., 90/10 for cautious canary deployment).

Collect data over sufficient time period

Run the A/B test until you’ve collected enough data for statistical significance:Minimum sample size:
  • At least 100-200 generations per variant
  • More samples for smaller expected differences
  • Use online sample size calculators for precise requirements
Time period:
  • Run for multiple days to account for day-of-week effects
  • Include weekdays and weekends if usage patterns differ
  • Ensure you capture diverse user segments and use cases
Monitor during collection:
  • Watch dashboards for unexpected issues
  • Check that traffic is splitting as expected
  • Verify metrics are being collected for both variants
Early stopping criteria: Stop the test early if:
  • One variant shows severe quality degradation
  • Error rates spike for one variant
  • Statistical significance is achieved with clear winner

Analyze results and calculate significance

Navigate to the prompt in ABV dashboard and compare metrics by version:Key metrics to compare:
  • Quality scores: Median score, score distribution by variant
  • Latency: Median, p95, p99 response times
  • Token usage: Input tokens, output tokens (affects cost)
  • Cost: Median cost per generation
  • User feedback: Thumbs up/down ratios, satisfaction ratings
Statistical significance:
  • Use significance tests (t-test, Mann-Whitney U test) to determine if differences are real
  • Calculate confidence intervals (95% CI recommended)
  • Consider practical significance: Is the improvement meaningful even if statistically significant?
Example analysis:
Variant A:
- Median quality score: 4.2/5
- Median latency: 450ms
- Median cost: $0.003
- Samples: 1,250

Variant B:
- Median quality score: 4.5/5 (7% improvement)
- Median latency: 480ms (6% slower)
- Median cost: $0.004 (33% more expensive)
- Samples: 1,238

Statistical significance: p < 0.05 (quality improvement is significant)
Decision: Variant B improves quality but at higher cost. Evaluate tradeoff.
Tools for analysis: Use Python (scipy, statsmodels), R, or online calculators for significance testing.

Make decision and deploy winner

Based on analysis, choose the winning variant:Clear winner:
  • Variant significantly better on primary metric (quality)
  • No significant degradation on secondary metrics (cost, latency)
  • Action: Promote winner to production by reassigning production label
Mixed results:
  • Variant better on quality but worse on cost
  • Small improvement with high uncertainty
  • Action: Evaluate tradeoffs, possibly run longer test, or choose based on business priorities
No significant difference:
  • Variants perform similarly across all metrics
  • Action: Keep existing version (simpler) or choose based on maintenance/cost
Deployment:
# After deciding variant-b is the winner, promote via UI or SDK:
abv.update_prompt(
    name="movie-critic",
    version=4,  # variant-b version number
    new_labels=["production"]  # Assign production label
)
Post-deployment monitoring: Continue monitoring quality after full rollout to ensure results hold at 100% traffic.

Implementation Examples

Complete examples for both SDKs:
Complete A/B testing implementation:
from abvdev import ABV
from openai import OpenAI
import random
import os

# Initialize clients
abv = ABV(
    api_key=os.getenv("ABV_API_KEY"),
    host="https://app.abv.dev",
)
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def run_ab_test(user_input: dict):
    """
    Run A/B test for movie critic prompt.

    Args:
        user_input: Dict with 'criticlevel' and 'movie' keys

    Returns:
        LLM response
    """
    # Fetch both variants
    prompt_a = abv.get_prompt("movie-critic", label="variant-a")
    prompt_b = abv.get_prompt("movie-critic", label="variant-b")

    # Randomly assign user to variant (50/50 split)
    selected_prompt = random.choice([prompt_a, prompt_b])

    # Compile prompt with user input
    compiled_prompt = selected_prompt.compile(
        criticlevel=user_input["criticlevel"],
        movie=user_input["movie"]
    )

    # Create generation with linked prompt
    with abv.start_as_current_observation(
        as_type="generation",
        name="movie-review-ab-test",
        prompt=selected_prompt  # Link for tracking by version
    ) as generation:
        # Call LLM
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": compiled_prompt}]
        )

        result = response.choices[0].message.content

        # Update generation with output
        generation.update(output=result)

        return result

# Usage
if __name__ == "__main__":
    result = run_ab_test({
        "criticlevel": "expert",
        "movie": "The Lord of the Rings"
    })
    print(result)

    # Flush events for short-lived applications
    abv.flush()
Weighted traffic split (90% control, 10% variant):
# Weighted random selection
selected_prompt = random.choices(
    [prompt_a, prompt_b],
    weights=[0.9, 0.1],  # 90% variant-a, 10% variant-b
    k=1
)[0]
Complete A/B testing implementation:Setup (instrumentation.ts):
import dotenv from "dotenv";
dotenv.config();

import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [
    new ABVSpanProcessor({
      apiKey: process.env.ABV_API_KEY,
      baseUrl: process.env.ABV_BASE_URL,
      exportMode: "immediate",
      flushAt: 1,
      flushInterval: 1,
    })
  ],
});

sdk.start();
A/B test implementation (index.ts):
import "./instrumentation"; // Must be first import
import { ABVClient } from "@abvdev/client";
import { startObservation } from "@abvdev/tracing";
import OpenAI from "openai";
import dotenv from "dotenv";
dotenv.config();

const abv = new ABVClient();
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

async function runABTest(userInput: {
  criticlevel: string;
  movie: string;
}) {
  // Fetch both variants
  const promptA = await abv.prompt.get("movie-critic", {
    label: "variant-a",
  });

  const promptB = await abv.prompt.get("movie-critic", {
    label: "variant-b",
  });

  // Randomly assign user to variant (50/50 split)
  const selectedPrompt = Math.random() < 0.5 ? promptA : promptB;

  // Compile prompt
  const compiledMessages = selectedPrompt.compile(userInput);

  // Create generation with linked prompt
  const generation = startObservation(
    "movie-review-ab-test",
    {
      model: "gpt-4o",
      input: compiledMessages,
      prompt: selectedPrompt  // Link for tracking
    },
    { asType: "generation" }
  );

  // Call LLM
  const completion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: compiledMessages,
  });

  const result = completion.choices[0].message.content;

  // Update generation
  generation.update({
    output: { content: result },
  });

  generation.end();

  return result;
}

// Usage
async function main() {
  const result = await runABTest({
    criticlevel: "expert",
    movie: "The Lord of the Rings"
  });
  console.log(result);
}

main();
Weighted traffic split (90% control, 10% variant):
const selectedPrompt = Math.random() < 0.9 ? promptA : promptB;
// 90% get promptA, 10% get promptB

Statistical Analysis

Understanding statistical concepts for A/B testing:
Definition: Probability that observed difference occurred by random chance.Interpretation:
  • p < 0.05: Less than 5% chance results are due to randomness (commonly used threshold)
  • p < 0.01: Less than 1% chance (stronger evidence)
  • p > 0.05: Difference not statistically significant (could be random)
Example:
  • Variant A: median score 4.2
  • Variant B: median score 4.5
  • p-value: 0.03
  • Conclusion: The 0.3 point improvement is statistically significant (p < 0.05)
Caution: Significance doesn’t guarantee practical importance. Always consider effect size.
Definition: Range where the true value likely falls.Interpretation:
  • 95% CI: We’re 95% confident the true value is in this range
  • Wider intervals indicate more uncertainty
  • Non-overlapping intervals suggest significant difference
Example:
  • Variant A: median score 4.2, 95% CI [4.0, 4.4]
  • Variant B: median score 4.5, 95% CI [4.3, 4.7]
  • Conclusion: Intervals don’t overlap—variant B is likely better
Use: Provides intuition about uncertainty in results, complements p-values.
Statistical power: Probability of detecting a real difference if it exists.Factors affecting required sample size:
  • Effect size: Smaller differences need more samples
  • Baseline variance: Higher variance needs more samples
  • Desired power: Higher power (80-90% recommended) needs more samples
  • Significance level: Stricter thresholds (p < 0.01) need more samples
Example calculation (simplified):
  • Baseline score: 4.0 (std dev 1.0)
  • Expected improvement: 10% (0.4 points)
  • Desired power: 80%
  • Significance: 0.05
  • Required samples: ~400 per variant
Tools: Use online calculators (Evan’s Awesome A/B Tools, Optimizely Sample Size Calculator) for precise calculations.
For continuous metrics (quality scores, latency):
  • t-test: Compares means, assumes normal distribution
  • Mann-Whitney U test: Compares medians, no distribution assumption (recommended for scores)
For binary metrics (thumbs up/down, success/failure):
  • Chi-square test: Compares proportions
  • Fisher’s exact test: For small sample sizes
For count data (errors, conversions):
  • Poisson test: Compares event rates
Python example (Mann-Whitney U test):
from scipy import stats

variant_a_scores = [4.2, 4.0, 4.5, 4.1, ...]  # 400 scores
variant_b_scores = [4.5, 4.3, 4.7, 4.4, ...]  # 400 scores

statistic, p_value = stats.mannwhitneyu(
    variant_a_scores,
    variant_b_scores,
    alternative='two-sided'
)

print(f"p-value: {p_value}")
if p_value < 0.05:
    print("Statistically significant difference")
else:
    print("No significant difference")

Common Pitfalls to Avoid

Problem: Declaring a winner after 50 samples because variant B looks better.Why it’s wrong: Small samples have high variance. Early results often don’t hold with more data.Solution: Pre-commit to minimum sample size (100-200+ per variant) before looking at results. Use sequential testing methods if you must peek early.
Problem: Running multiple tests on the same data until you find statistical significance.Example: Testing 20 different metrics, finding that 1 is significant at p < 0.05 (expected by chance).Solution: Pre-register your primary metric before starting the test. Treat secondary metrics as exploratory only.
Problem: Deploying a variant because it’s statistically better, even though the improvement is tiny.Example: p < 0.01 but quality improves only 0.5% while cost increases 30%.Solution: Set minimum thresholds for practical significance before the test. Consider cost-benefit tradeoffs.
Problem: Implementing A/B test but forgetting to link prompts to generation spans.Result: ABV can’t aggregate metrics by prompt version. You have no way to compare variants.Solution: Always pass prompt=selected_prompt when creating generation spans:
with abv.start_as_current_observation(
    as_type="generation",
    prompt=selected_prompt  # Don't forget this!
) as generation:
    ...
Problem: Running variant A during weekdays and variant B during weekends, then concluding B is better.Why it’s wrong: Weekend traffic might differ from weekday traffic. You can’t tell if the difference is due to the prompt or the day of week.Solution: Run variants concurrently with randomized assignment to ensure comparable populations.

Next Steps