Use this file to discover all available pages before exploring further.
A/B testing (also called split testing) enables comparing two or more prompt versions in production with real users and use cases. Rather than choosing between prompts based on intuition or small-scale testing, A/B testing provides statistical evidence about which prompt performs better under real-world conditions.
Safety-critical systems (autonomous vehicles, industrial controls)
Low-volume applications:
Fewer than 100 daily users (insufficient statistical power)
Use cases with long feedback cycles (weeks between samples)
Scenarios where each request is unique (no aggregate patterns)
High-stakes accuracy requirements:
Applications where any error is unacceptable
Regulated industries with strict compliance requirements
Use cases requiring deterministic outputs
Alternative: For these scenarios, use comprehensive offline evaluation on datasets before deploying to production, then monitor with 100% production traffic rather than split testing.
Prerequisites for Successful A/B Testing
Before starting A/B testing, ensure you have:
Measurable success metrics: Quality scores, user feedback, task completion rates, or business outcomes
Sufficient traffic volume: At least 100-200 samples per variant for statistical significance
Prompt linking infrastructure: Ability to link prompts to traces for metric aggregation
Monitoring dashboards: Real-time visibility into quality metrics by prompt version
Rollback capability: Ability to stop the test and revert if issues arise
Statistical analysis skills: Understanding of significance testing, confidence intervals, and statistical power
Without these prerequisites, A/B testing becomes guesswork rather than scientific experimentation.
Complete workflow from setup to decision:The A/B testing lifecycle: Create variants â Implement random assignment â Collect sufficient data â Analyze for statistical significance â Make deployment decision â Monitor results.
Create prompt variants and assign labels
Create two (or more) prompt versions with different content, structure, or parameters:Via ABV UI:
Navigate to your prompt in the ABV dashboard
Create a new version with variant A content
Assign label variant-a (or prod-a)
Create another version with variant B content
Assign label variant-b (or prod-b)
Via SDK:
# Create variant Aabv.create_prompt( name="movie-critic", prompt="As a {{criticlevel}} movie critic, provide a detailed review of {{movie}}.", labels=["variant-a"], config={"temperature": 0.7})# Create variant Babv.create_prompt( name="movie-critic", prompt="You're a {{criticlevel}} film critic. Share your thoughts on {{movie}}.", labels=["variant-b"], config={"temperature": 0.8})
Version numbers: ABV automatically assigns incremental version numbers (e.g., versions 3 and 4), but youâll reference by label in your code.
Implement randomized assignment in application code
Modify your application to randomly select between variants for each request:Python implementation:
from abvdev import ABVfrom openai import OpenAIimport randomabv = ABV(api_key="sk-abv-...", host="https://app.abv.dev")openai_client = OpenAI(api_key="sk-proj-...")# Fetch both variantsprompt_a = abv.get_prompt("movie-critic", label="variant-a")prompt_b = abv.get_prompt("movie-critic", label="variant-b")# Randomly select variant (50/50 split)selected_prompt = random.choice([prompt_a, prompt_b])# Compile and usecompiled_prompt = selected_prompt.compile( criticlevel="expert", movie="Dune 2")# Link prompt to trace for metric trackingwith abv.start_as_current_observation( as_type="generation", name="movie-review", prompt=selected_prompt # Crucial: link for metrics) as generation: response = openai_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": compiled_prompt}] ) generation.update(output=response.choices[0].message.content)abv.flush() # For short-lived applications
User feedback: Thumbs up/down ratios, satisfaction ratings
Statistical significance:
Use significance tests (t-test, Mann-Whitney U test) to determine if differences are real
Calculate confidence intervals (95% CI recommended)
Consider practical significance: Is the improvement meaningful even if statistically significant?
Example analysis:
Variant A:- Median quality score: 4.2/5- Median latency: 450ms- Median cost: $0.003- Samples: 1,250Variant B:- Median quality score: 4.5/5 (7% improvement)- Median latency: 480ms (6% slower)- Median cost: $0.004 (33% more expensive)- Samples: 1,238Statistical significance: p < 0.05 (quality improvement is significant)Decision: Variant B improves quality but at higher cost. Evaluate tradeoff.
Tools for analysis: Use Python (scipy, statsmodels), R, or online calculators for significance testing.
Make decision and deploy winner
Based on analysis, choose the winning variant:Clear winner:
Variant significantly better on primary metric (quality)
No significant degradation on secondary metrics (cost, latency)
Action: Promote winner to production by reassigning production label
Mixed results:
Variant better on quality but worse on cost
Small improvement with high uncertainty
Action: Evaluate tradeoffs, possibly run longer test, or choose based on business priorities
No significant difference:
Variants perform similarly across all metrics
Action: Keep existing version (simpler) or choose based on maintenance/cost
Deployment:
# After deciding variant-b is the winner, promote via UI or SDK:abv.update_prompt( name="movie-critic", version=4, # variant-b version number new_labels=["production"] # Assign production label)
Post-deployment monitoring: Continue monitoring quality after full rollout to ensure results hold at 100% traffic.
Problem: Declaring a winner after 50 samples because variant B looks better.Why itâs wrong: Small samples have high variance. Early results often donât hold with more data.Solution: Pre-commit to minimum sample size (100-200+ per variant) before looking at results. Use sequential testing methods if you must peek early.
P-Hacking (Data Dredging)
Problem: Running multiple tests on the same data until you find statistical significance.Example: Testing 20 different metrics, finding that 1 is significant at p < 0.05 (expected by chance).Solution: Pre-register your primary metric before starting the test. Treat secondary metrics as exploratory only.
Ignoring Practical Significance
Problem: Deploying a variant because itâs statistically better, even though the improvement is tiny.Example: p < 0.01 but quality improves only 0.5% while cost increases 30%.Solution: Set minimum thresholds for practical significance before the test. Consider cost-benefit tradeoffs.
Not Linking Prompts to Traces
Problem: Implementing A/B test but forgetting to link prompts to generation spans.Result: ABV canât aggregate metrics by prompt version. You have no way to compare variants.Solution: Always pass prompt=selected_prompt when creating generation spans:
with abv.start_as_current_observation( as_type="generation", prompt=selected_prompt # Don't forget this!) as generation: ...
Confounding Variables
Problem: Running variant A during weekdays and variant B during weekends, then concluding B is better.Why itâs wrong: Weekend traffic might differ from weekday traffic. You canât tell if the difference is due to the prompt or the day of week.Solution: Run variants concurrently with randomized assignment to ensure comparable populations.