Prompt Management

A/B Testing of LLM Prompts

5 min

get started with prompt management docid\ cfty8fgfemho15jfh2983 enables a/b testing by allowing you to label different versions of a prompt (e g , prod a and prod b ) your application can randomly alternate between these versions, while abv tracks performance metrics like response latency, cost, token usage, and evaluation metrics for each version when to use a/b testing? a/b testing helps you see how different prompt versions work in real situations, adding to what you learn from testing on datasets this works best when your app has good ways to measure success, deals with many different kinds of user inputs, and can handle some ups and downs in performance this usually works for consumer apps where mistakes aren't a big deal you've already tested thoroughly on your test data and want to try your changes with a small group of users before rolling out to everyone (also called canary deployment) implementation 1\) label your prompt versions label your prompt versions (e g , prod a and prod b ) to identify different variants for testing 2\) fetch prompts and run a/b test python sdk from abvdev import get client import random \# requires environment variables for initialization from abvdev import get client abv = get client() from openai import openai openai client = openai() \# fetch prompt versions prompt a = abv get prompt("my prompt name", label="prod a") prompt b = abv get prompt("my prompt name", label="prod b") \# randomly select version selected prompt = random choice(\[prompt a, prompt b]) \# use in llm call with abv start as current observation(as type="generation", name="openai gen", prompt=selected prompt) response = openai client chat completions create( model="gpt 5 2025 08 07", messages=\[{"role" "user", "content" selected prompt compile(variable="value")}], ) result text = response choices\[0] message content \# flush events to abv in short lived applications abv flush() js/ts sdk import { abvclient } from "@abvdev/client"; import { startobservation } from "@abvdev/tracing"; import openai from "openai"; const openai = new openai({ apikey process env openai api key, }); // requires environment variables for initialization const abv = new abvclient(); // fetch prompt versions const prompta = await abv prompt get("my prompt name", { label "prod a", }); const promptb = await abv prompt get("my prompt name", { label "prod b", }); // randomly select version const selectedprompt = math random() < 0 5 ? prompta promptb; const model = "gpt 4o"; // use in llm call // start a trace without knowing the user yet const generation = startobservation( "llm call a b", { model model, input \[{ role "user", content "what is the capital of france?" }], prompt selectedprompt }, { astype "generation" } ); const completion = await openai chat completions create({ model model, messages \[ { role "user", content selectedprompt compile({ variable "value" }), }, ], }); const resulttext = completion choices\[0] message content; generation update({ output { content resulttext }, }); generation end(); refer to get started with prompt management docid\ cfty8fgfemho15jfh2983 for additional examples on how to fetch and use prompts 3\) analyze results compare metrics for each prompt version in the abv ui key metrics available for comparison response latency and token usage cost per request quality evaluation scores custom metrics you define