Once you created a dataset, you can use the dataset to test how your application performs on different inputs.
Remote Dataset Runs are used to programmatically loop your applications or prompts through a dataset and optionally apply Evaluation Methods to the results.They are called “Remote Dataset Runs” because they can make use of “remote” or external logic and code.Optionally, you can also trigger Remote Dataset Runs via the ABV UI which will call them via a webhook.
First we create our application runner helper function. This function will be called for every dataset item in the next step. If you use ABV for production observability, you do not need to change your application code.
For a dataset run, it is important that your application creates ABV traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.
Please make sure you have the JS/TS SDK set up for tracing of your application. If you use ABV for observability, this is the same setup.Install packages
Add credentialsAdd your ABV credentials to your environment variables. Make sure that you have a .env file in your project root and a package like dotenv to load the variables.
.env
Copy
ABV_API_KEY="sk-abv-..."ABV_BASE_URL="https://app.abv.dev" # US region# ABV_BASE_URL="https://eu.app.abv.dev" # EU regionOPENAI_API_KEY="sk-proj-..." # added to use openai llm
Initialize OpenTelemetryInstall the OpenTelemetry Node SDK package:
Copy
npm install @opentelemetry/sdk-node
Create a instrumentation.ts file that initializes the OpenTelemetry NodeSDK and registers the ABVSpanProcessor.
instrumentation.ts
Copy
import { NodeSDK } from "@opentelemetry/sdk-node";import { ABVSpanProcessor } from "@abvdev/otel";const sdk = new NodeSDK({ spanProcessors: [new ABVSpanProcessor()],});sdk.start();
Modify instrumentation.ts file to use dotenv package to load the variables.Additional parameters are provided to get trace visible in the UI immediately.
Copy
npm install dotenv
instrumentation.ts
Copy
import dotenv from "dotenv";dotenv.config();import { NodeSDK } from "@opentelemetry/sdk-node";import { ABVSpanProcessor } from "@abvdev/otel";const sdk = new NodeSDK({ spanProcessors: [ new ABVSpanProcessor({ apiKey: process.env.ABV_API_KEY, baseUrl: process.env.ABV_BASE_URL, exportMode: "immediate", flushAt: 1, flushInterval: 1, additionalHeaders: { "Content-Type": "application/json", "Accept": "application/json" } }) ],});sdk.start();
Import the instrumentation.ts file at the top of your application.
index.ts
Copy
import "./instrumentation"; // Must be the first import
Instrumentation:
app.ts
Copy
import { ABVClient } from "@abvdev/client";import { startActiveObservation } from "@abvdev/tracing";import OpenAI from "openai";import dotenv from "dotenv";dotenv.config();const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY,});const myLLMApplication = async (input: string) => { return startActiveObservation("my-llm-application", async (span) => { const output = await openai.chat.completions.create({ model: "gpt-4o", messages: [{ role: "user", content: input }], }); span.update({ input, output: output.choices[0].message.content }); // return reference to span and output // will be simplified in a future version of the SDK return [span, output] as const; })};
When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name.
You may then execute that LLM-app for each dataset item to create a dataset run:execute_dataset.py
Copy
from abvdev import get_clientfrom .app import my_llm_application# Load the datasetdataset = get_client().get_dataset("<dataset_name>")# Loop over the dataset itemsfor item in dataset.items: # Use the item.run() context manager for automatic trace linking with item.run( run_name="<run_name>", run_description="My first run", run_metadata={"model": "llama3"}, ) as root_span: # Execute your LLM-app against the dataset item input output = my_llm_application.run(item.input) # Optionally: Add scores computed in your experiment runner, e.g. json equality check root_span.score_trace( name="<example_eval>", value=my_eval_fn(item.input, output, item.expected_output), comment="This is a comment", # optional, useful to add reasoning )# Flush the abv client to ensure all data is sent to the server at the end of the experiment runget_client().flush()
See Python SDK docs for details on the new OpenTelemetry-based SDK.
import { ABVClient } from "@abvdev/client";const abv = new ABVClient();for (const item of dataset.items) { // execute application function and get abvObject (trace/span/generation/event) // output also returned as it is used to evaluate the run // you can also link using ids, see sdk reference for details const [span, output] = await myLlmApplication.run(item.input); // link the execution trace to the dataset item and give it a run_name await item.link(span, "<run_name>", { description: "My first run", // optional run description metadata: { model: "llama3" }, // optional run metadata }); // Optionally: Add scores abv.score.trace(span, { name: "<score_name>", value: myEvalFunction(item.input, output, item.expectedOutput), comment: "This is a comment", // optional, useful to add reasoning });}// Flush the abv client to ensure all score data is sent to the server at the end of the experiment runawait abv.flush();
If you want to learn more about how adding evaluation scores from the code works, please refer to the docs:Add custom scores
In the code above, we show how to add scores to the dataset run from your experiment code.Alternatively, you can run evals in ABV. This is useful if you want to use the LLM-as-a-Judge feature to evaluate the outputs of the dataset runs.Set up LLM-as-a-judge
When setting up Remote Dataset Runs via the SDK, it can be useful to expose a trigger in the ABV UI that helps you trigger the experiment runs.You need to set up a webhook to receive the trigger request from ABV.
Enter the URL of your external evaluation service that will receive the webhook when experiments are triggered.
Specify a default config that will be sent to your webhook. Users can modify this when triggering experiments.
Once configured, team members can trigger remote experiments via the Run button under the Custom Experiment option. ABV will send the dataset metadata (ID and name) along with any custom configuration to your webhook.Typical workflow: Your webhook receives the request, fetches the dataset from ABV, runs your application against the dataset items, evaluates the results, and ingests the scores back into ABV as a new Dataset Run.