Evaluations
Remote Dataset Runs
12 min
once you created a dataset https //docs abv dev/datasets , you can use the dataset to test how your application performs on different inputs remote dataset runs are used to programmatically loop your applications or prompts through a dataset and optionally apply evaluation methods to the results they are called "remote dataset runs" because they can make use of "remote" or external logic and code optionally, you can also trigger remote dataset runs via the abv ui which will call them via a webhook why use remote dataset runs? full flexibility to use your own application logic use custom scoring functions to evaluate the outputs run multiple experiments on the same dataset in parallel easy to integrate with your existing evaluation infrastructure setup & run via sdk sequence diagram sequencediagram actor person as user participant user as experiment runner participant app as llm application participant lf as abv server note over person, lf setup & execute remote dataset run %% trigger dataset run person >>user trigger dataset run %% load dataset user >>lf get dataset("dataset name") lf >>user return dataset with items %% loop through dataset items loop for each dataset item note over user, lf process dataset item %% start dataset run context user >>lf item run(run name, description, metadata) lf >>user return run context (root span) %% execute llm application user >>app execute llm app with item input note over app create abv trace\<br/>for execution app >>lf app is instrumented with abv and reports traces app >>user return application output %% optional add evaluation scores from experiment runner opt add custom scores note over user run evaluation function locally user >>lf root span score trace(name, value, comment) end %% link execution to dataset item note over user, lf trace automatically linked\<br/>to dataset run end %% flush data to server user >>lf flush() send all pending data lf >>user confirm data received user >>person dataset run complete %% optional abv server side evaluations opt server side evaluations note over lf run configured evaluations\<br/>(e g , llm as a judge) lf >>lf add evaluation scores to dataset run end %% view results in ui note over person, lf analyze dataset run results person >>lf access dataset runs ui lf >>person display aggregated scores and comparisons 1\) instrument your application first we create our application runner helper function this function will be called for every dataset item in the next step if you use abv for production observability, you do not need to change your application code for a dataset run, it is important that your application creates abv traces for each execution so they can be linked to the dataset item please refer to the integrations page for details on how to instrument the framework you are using python sdk assume you already have a abv instrumented llm app app py from abvdev import get client, observe from openai import openai @observe def my llm function(question str) with abv start as current observation(as type="generation", name="openai gen") response = openai() chat completions create( model="gpt 4o", messages=\[{"role" "user", "content" question}] ) output = response choices\[0] message content \# update trace input / output get client() update current trace(input=question, output=output) return output see python sdk https //docs abv dev/python sdk docs for more details js/ts sdk please make sure you have the typescript sdk setup docid\ ize1 ee38y2m1vyf9d 84 for tracing of your application if you use abv for observability & tracing docid\ l z5ltur bd4f7v9mqkrc , this is the same setup example app ts import { abvclient } from "@abvdev/client"; import { startactiveobservation } from "@abvdev/tracing"; import openai from "openai"; const openai = new openai({ apikey process env openai api key, }); const myllmapplication = async (input string) => { return startactiveobservation("my llm application", async (span) => { const output = await openai chat completions create({ model "gpt 4o", messages \[{ role "user", content input }], }); span update({ input, output output choices\[0] message content }); // return reference to span and output // will be simplified in a future version of the sdk return \[span, output] as const; } }; 2\) run experiment on dataset when running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset the execution trace is then linked to the dataset item this allows you to compare different runs of the same application on the same dataset each experiment is identified by a run name python sdk you may then execute that llm app for each dataset item to create a dataset run execute dataset py from abvdev import get client from app import my llm application \# load the dataset dataset = get client() get dataset("\<dataset name>") \# loop over the dataset items for item in dataset items \# use the item run() context manager for automatic trace linking with item run( run name="\<run name>", run description="my first run", run metadata={"model" "llama3"}, ) as root span \# execute your llm app against the dataset item input output = my llm application run(item input) \# optionally add scores computed in your experiment runner, e g json equality check root span score trace( name="\<example eval>", value=my eval fn(item input, output, item expected output), comment="this is a comment", # optional, useful to add reasoning ) \# flush the abv client to ensure all data is sent to the server at the end of the experiment run get client() flush() see python sdk https //docs abv dev/python sdk docs for details on the new opentelemetry based sdk js/ts sdk import { abvclient } from "@abvdev/client"; const abv = new abvclient(); for (const item of dataset items) { // execute application function and get abvobject (trace/span/generation/event) // output also returned as it is used to evaluate the run // you can also link using ids, see sdk reference for details const \[span, output] = await myllmapplication run(item input); // link the execution trace to the dataset item and give it a run name await item link(span, "\<run name>", { description "my first run", // optional run description metadata { model "llama3" }, // optional run metadata }); // optionally add scores abv score trace(span, { name "\<score name>", value myevalfunction(item input, output, item expectedoutput), comment "this is a comment", // optional, useful to add reasoning }); } // flush the abv client to ensure all score data is sent to the server at the end of the experiment run await abv flush(); if you want to learn more about how adding evaluation scores from the code works, please refer to the docs add custom scores https //docs abv dev/custom scores 3\) optionally run evals in abv in the code above, we show how to add scores to the dataset run from your experiment code alternatively, you can run evals in abv this is useful if you want to use the llm as a judge https //docs abv dev/llm as a judge feature to evaluate the outputs of the dataset runs we have recorded a 10 min walkthrough on how this works end to end set up llm as a judge https //docs abv dev/llm as a judge 4\) compare dataset runs after each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side by side optional trigger remote dataset runs via ui when setting up remote dataset runs via the sdk, it can be useful to expose a trigger in the abv ui that helps you trigger the experiment runs you need to set up a webhook to receive the trigger request from abv 1\) navigate to the dataset navigate to your project > datasets click on the dataset you want to set up a remote experiment trigger for 2\) open the setup page click on start experiment to open the setup page click on âš¡ below custom experiment 3\) configure the webhook enter the url of your external evaluation service that will receive the webhook when experiments are triggered specify a default config that will be sent to your webhook users can modify this when triggering experiments 4\) trigger experiments once configured, team members can trigger remote experiments via the run button under the custom experiment option abv will send the dataset metadata (id and name) along with any custom configuration to your webhook typical workflow your webhook receives the request, fetches the dataset from abv, runs your application against the dataset items, evaluates the results, and ingests the scores back into abv as a new dataset run