Skip to main content
Once you created a dataset, you can use the dataset to test how your application performs on different inputs. Remote Dataset Runs are used to programmatically loop your applications or prompts through a dataset and optionally apply Evaluation Methods to the results. They are called “Remote Dataset Runs” because they can make use of “remote” or external logic and code. Optionally, you can also trigger Remote Dataset Runs via the ABV UI which will call them via a webhook.

Why use Remote Dataset Runs?

  • Full flexibility to use your own application logic
  • Use custom scoring functions to evaluate the outputs
  • Run multiple experiments on the same dataset in parallel
  • Easy to integrate with your existing evaluation infrastructure

Setup & Run via SDK

1) Instrument your application

First we create our application runner helper function. This function will be called for every dataset item in the next step. If you use ABV for production observability, you do not need to change your application code.
For a dataset run, it is important that your application creates ABV traces for each execution so they can be linked to the dataset item. Please refer to the integrations page for details on how to instrument the framework you are using.

Python SDK

Install packages
pip install abvdev, openai
Assume you already have a ABV-instrumented LLM-app:
app.py
from abvdev import ABV, observe
from openai import OpenAI

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

openai_client = OpenAI(api_key="sk-proj-...")

@observe
def my_llm_function(question: str):
  with abv.start_as_current_observation(as_type="generation", name="OpenAI-gen"):
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": question}]
    )

    output = response.choices[0].message.content

    # Update trace input / output
    abv.update_current_trace(input=question, output=output)

    return output
See Python SDK docs for more details.

JS/TS SDK

Please make sure you have the JS/TS SDK set up for tracing of your application. If you use ABV for observability, this is the same setup. Install packages
npm install @abvdev/tracing @abvdev/otel @abvdev/client openai
Add credentials Add your ABV credentials to your environment variables. Make sure that you have a .env file in your project root and a package like dotenv to load the variables.
.env
ABV_API_KEY="sk-abv-..."
ABV_BASE_URL="https://app.abv.dev" # US region
# ABV_BASE_URL="https://eu.app.abv.dev" # EU region
OPENAI_API_KEY="sk-proj-..." # added to use openai llm
Initialize OpenTelemetry Install the OpenTelemetry Node SDK package:
npm install @opentelemetry/sdk-node
Create a instrumentation.ts file that initializes the OpenTelemetry NodeSDK and registers the ABVSpanProcessor.
instrumentation.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [new ABVSpanProcessor()],
});

sdk.start();
Modify instrumentation.ts file to use dotenv package to load the variables. Additional parameters are provided to get trace visible in the UI immediately.
npm install dotenv
instrumentation.ts
import dotenv from "dotenv";
dotenv.config();

import { NodeSDK } from "@opentelemetry/sdk-node";
import { ABVSpanProcessor } from "@abvdev/otel";

const sdk = new NodeSDK({
  spanProcessors: [
    new ABVSpanProcessor({
      apiKey: process.env.ABV_API_KEY,
      baseUrl: process.env.ABV_BASE_URL,
      exportMode: "immediate",
      flushAt: 1,
      flushInterval: 1,
      additionalHeaders: {
        "Content-Type": "application/json",
        "Accept": "application/json"
      }
    })
  ],
});

sdk.start();
Import the instrumentation.ts file at the top of your application.
index.ts
import "./instrumentation"; // Must be the first import
Instrumentation:
app.ts
import { ABVClient } from "@abvdev/client";
import { startActiveObservation } from "@abvdev/tracing";
import OpenAI from "openai";
import dotenv from "dotenv";
dotenv.config();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});


const myLLMApplication = async (input: string) => {
  return startActiveObservation("my-llm-application", async (span) => {
    const output = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: input }],
    });

    span.update({ input, output: output.choices[0].message.content });

    // return reference to span and output
    // will be simplified in a future version of the SDK
    return [span, output] as const;
  })
};

2) Run experiment on dataset

When running an experiment on a dataset, the application that shall be tested is executed for each item in the dataset. The execution trace is then linked to the dataset item. This allows you to compare different runs of the same application on the same dataset. Each experiment is identified by a run_name.

Python SDK

You may then execute that LLM-app for each dataset item to create a dataset run: execute_dataset.py
from abvdev import get_client
from .app import my_llm_application

# Load the dataset
dataset = get_client().get_dataset("<dataset_name>")

# Loop over the dataset items
for item in dataset.items:
    # Use the item.run() context manager for automatic trace linking
    with item.run(
        run_name="<run_name>",
        run_description="My first run",
        run_metadata={"model": "llama3"},
    ) as root_span:
        # Execute your LLM-app against the dataset item input
        output = my_llm_application.run(item.input)

        # Optionally: Add scores computed in your experiment runner, e.g. json equality check
        root_span.score_trace(
            name="<example_eval>",
            value=my_eval_fn(item.input, output, item.expected_output),
            comment="This is a comment",  # optional, useful to add reasoning
        )

# Flush the abv client to ensure all data is sent to the server at the end of the experiment run
get_client().flush()
See Python SDK docs for details on the new OpenTelemetry-based SDK.

JS/TS SDK

import { ABVClient } from "@abvdev/client";
 
const abv = new ABVClient();
 
for (const item of dataset.items) {
  // execute application function and get abvObject (trace/span/generation/event)
  // output also returned as it is used to evaluate the run
  // you can also link using ids, see sdk reference for details
  const [span, output] = await myLlmApplication.run(item.input);
 
  // link the execution trace to the dataset item and give it a run_name
  await item.link(span, "<run_name>", {
    description: "My first run", // optional run description
    metadata: { model: "llama3" }, // optional run metadata
  });
 
  // Optionally: Add scores
  abv.score.trace(span, {
    name: "<score_name>",
    value: myEvalFunction(item.input, output, item.expectedOutput),
    comment: "This is a comment", // optional, useful to add reasoning
  });
}
 
// Flush the abv client to ensure all score data is sent to the server at the end of the experiment run
await abv.flush();
If you want to learn more about how adding evaluation scores from the code works, please refer to the docs: Add custom scores

3) Optionally: Run Evals in ABV

In the code above, we show how to add scores to the dataset run from your experiment code. Alternatively, you can run evals in ABV. This is useful if you want to use the LLM-as-a-Judge feature to evaluate the outputs of the dataset runs. Set up LLM-as-a-judge

4) Compare dataset runs

After each experiment run on a dataset, you can check the aggregated score in the dataset runs table and compare results side-by-side.

Optional: Trigger Remote Dataset Runs via UI

When setting up Remote Dataset Runs via the SDK, it can be useful to expose a trigger in the ABV UI that helps you trigger the experiment runs. You need to set up a webhook to receive the trigger request from ABV.

1) Navigate to the dataset

  • Navigate to Your Project > Datasets
  • Click on the dataset you want to set up a remote experiment trigger for

2) Open the setup page

Click on Start Experiment to open the setup page Click on below Custom Experiment

3) Configure the webhook

Enter the URL of your external evaluation service that will receive the webhook when experiments are triggered. Specify a default config that will be sent to your webhook. Users can modify this when triggering experiments.

4) Trigger experiments

Once configured, team members can trigger remote experiments via the Run button under the Custom Experiment option. ABV will send the dataset metadata (ID and name) along with any custom configuration to your webhook. Typical workflow: Your webhook receives the request, fetches the dataset from ABV, runs your application against the dataset items, evaluates the results, and ingests the scores back into ABV as a new Dataset Run.