Evaluations
Prompt Experiments
9 min
you can execute prompt experiments in the abv ui to test different prompt versions from prompt management https //docs abv dev/prompt management overview or language models and compare the results side by side optionally, you can use llm as a judge evaluators https //docs abv dev/llm as a judge to automatically score the responses based on the expected outputs to further analyze the results on an aggregate level why use prompt experiments? quickly test different prompt versions or models structure your prompt testing by using a dataset to test different prompt versions and models quickly iterate on prompts through dataset runs optionally use llm as a judge evaluators to score the responses based on the expected outputs from the dataset prevent regressions by running tests when making prompt changes prerequisites 1\) create a usable prompt create a prompt that you want to test and evaluate how to create a prompt ? https //docs abv dev/get started with prompt management a prompt is usable when a prompt is usable when your prompt has variables that match the dataset item keys in the dataset that will be used for the dataset run see the example below example prompt variables & dataset item keys mapping prompt you are an abv expert answer based on {{documentation}} question {{question}} dataset item { "documentation" "abv is an llm engineering platform", "question" "what is abv?" } in this example the prompt variable {{documentation}} maps to the json key "documentation" the prompt variable {{question}} maps to the json key "question" both keys must exist in the dataset item's input json for the experiment to run successfully example chat message placeholder mapping in addition to variables, you can also map placeholders in chat message prompts to dataset item keys this is useful when the dataset item also contains for example a chat message history to use your chat prompt needs to contain a placeholder with a name variables within placeholders are not resolved chat prompt placeholder named message history dataset item { "message history" \[ { "role" "user", "content" "what is abv?" }, { "role" "assistant", "content" "abv is a tool for tracking and analyzing the performance of language models " } ], "question" "what is abv?" } in this example the chat prompt placeholder message history maps to the json key "message history" the prompt variable {{question}} maps to the json key "question" in a variable not within a placeholder message both keys must exist in the dataset item's input json for the experiment to run successfully 2\) create a usable dataset create a dataset with the inputs and expected outputs you want to use for your prompt experiments how to create a dataset? https //docs abv dev/datasets a dataset is usable when \[1] the dataset items have json objects as input and \[2] these objects have json keys that matches the prompt variables of the prompt(s) you will use see the example below example prompt variables & dataset item keys mapping prompt you are an abv expert answer based on {{documentation}} question {{question}} dataset item { "documentation" "abv is an llm engineering platform", "question" "what is abv?" } in this example the prompt variable {{documentation}} maps to the json key "documentation" the prompt variable {{question}} maps to the json key "question" both keys must exist in the dataset item's input json for the experiment to run successfully 3\) configure llm connection as your prompt will be executed for each dataset item, you need to configure an llm connection in the project settings how to configure an llm connection? https //docs abv dev/llm connections 4\) optional set up llm as a judge you can set up an llm as a judge evaluator to score the responses based on the expected outputs make sure to set the target of the llm as a judge to "experiment runs" and filter for the dataset you want to use how to set up llm as a judge? https //docs abv dev/llm as a judge trigger a prompt experiment 1\) navigate to the dataset dataset runs are currently started from the detail page of a dataset navigate to your project > datasets click on the dataset you want to start a dataset run for 2\) open the setup page click on start experiment to open the setup page click on create below prompt experiment 3\) configure the dataset run set a dataset run name select the prompt you want to use set up or select the llm connection you want to use select the dataset you want to use optionally select the evaluator you want to use click on create to trigger the dataset run this will trigger the dataset run and you will be redirected to the dataset runs page the run might take a few seconds or minutes to complete depending on the prompt complexity and dataset size 4\) compare runs after each experiment run, you can check the aggregated score in the dataset runs table and compare results side by side