The ABV Python SDK provides ways to evaluate your application. You can add custom scores to your traces and observations, or use the SDK to execute Dataset Runs.

This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation Overview for more information on how to evaluate your application in ABV.

Scoring traces and observations

Install package

pip install abvdev

ABVSpan / ABVGeneration object methods

span_or_generation_obj.score(): Scores the specific observation object.
span_or_generation_obj.score_trace(): Scores the entire trace to which the object belongs.

from abvdev import ABV

# ABV client initialization          
abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

with abv.start_as_current_observation(as_type='generation', name="summary_generation") as gen:
    # ... LLM call ...
    gen.update(output="summary text...")
    # Score this specific generation
    gen.score(name="conciseness", value=0.8, data_type="NUMERIC")
    # Score the overall trace
    gen.score_trace(name="user_feedback_rating", value="positive", data_type="CATEGORICAL")

Context-aware methods

abv.score_current_span(): Scores the currently active observation in the context.
abv.score_current_trace(): Scores the trace of the currently active observation.

from abvdev import ABV

# ABV client initialization          
abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# imitate successful task completion
task_is_fully_successful = True

with abv.start_as_current_span(name="complex_task") as task_span:
    # ... perform task ...
    abv.score_current_span(name="task_component_quality", value=True, data_type="BOOLEAN")
    # ...
    if task_is_fully_successful:
         abv.score_current_trace(name="overall_success", value=1.0, data_type="NUMERIC")

Low-level method

Creates a score for a specified trace_id and optionally observation_id.
Useful when IDs are known, or for scoring after the trace/observation has completed.

from abvdev import ABV

# ABV client initialization          
abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

abv.create_score(
    name="fact_check_accuracy",
    value=0.95, # Can be float for NUMERIC/BOOLEAN, string for CATEGORICAL
    trace_id="your_trace_id",
    observation_id="your_observation_id", # Optional: if scoring a specific observation
    session_id="your_session_id", # Optional: if scoring a specific session   
    data_type="NUMERIC", # "NUMERIC", "BOOLEAN", "CATEGORICAL"
    comment="Source verified for 95% of claims."
)

Score Parameters:

Parameter	Type	Description
`name`	`str`	Name of the score (e.g., “relevance”, “accuracy”). Required.
`value`	`Union[float, str]`	Score value. Float for `NUMERIC`/`BOOLEAN`, string for `CATEGORICAL`. Required.
`trace_id`	`str`	ID of the trace to associate with (for `create_score`). Required.
`session_id`	`Optional[str]`	ID of the specific session to score (for `create_score`).
`observation_id`	`Optional[str]`	ID of the specific observation to score (for `create_score`).
`score_id`	`Optional[str]`	Custom ID for the score (auto-generated if None).
`data_type`	`Optional[ScoreDataType]`	`"NUMERIC"`, `"BOOLEAN"`, or `"CATEGORICAL"`. Inferred if not provided based on value type and score config on server.
`comment`	`Optional[str]`	Optional comment or explanation for the score.
`config_id`	`Optional[str]`	Optional ID of a pre-defined score configuration in ABV.

See Scoring for more details.

Datasets

ABV Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.

Interacting with Datasets

Creating: You can programmatically create new datasets with abv.create_dataset(...) and add items to them using abv.create_dataset_item(...).
Fetching: Retrieve a dataset and its items using abv.get_dataset(name: str). This returns a DatasetClient instance, which contains a list of DatasetItemClient objects (accessible via dataset.items). Each DatasetItemClient holds the input, expected_output, and metadata for an individual data point.

from abvdev import ABV

# ABV client initialization          
abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Fetch an existing dataset
dataset = abv.get_dataset(name="my-eval-dataset")
for item in dataset.items:
    print(f"Input: {item.input}, Expected: {item.expected_output}")

# Briefly: Creating a dataset and an item
new_dataset = abv.create_dataset(name="new-summarization-tasks")
abv.create_dataset_item(
    dataset_name="new-summarization-tasks",
    input={"text": "Long article..."},
    expected_output={"summary": "Short summary."}
)

Linking Traces to Dataset Items for Runs

The most powerful way to use datasets is by linking your application’s executions (traces) to specific dataset items when performing an evaluation run. See our datasets for more details. The DatasetItemClient.run() method provides a context manager to streamline this process. How item.run() works: When you use with item.run(run_name="your_eval_run_name") as root_span::

Trace Creation: A new ABV trace is initiated specifically for processing this dataset item within the context of the named run.
Trace Naming & Metadata:
- The trace is automatically named (e.g., “Dataset run: your_eval_run_name”).
- Essential metadata is added to this trace, including dataset_item_id (the ID of item), run_name, and dataset_id.
DatasetRunItem Linking: The SDK makes an API call to ABV to create a DatasetRunItem. This backend object formally links:
- The dataset_item_id
- The trace_id of the newly created trace
- The provided run_name
- Any run_metadata or run_description you pass to item.run(). This linkage is what populates the “Runs” tab for your dataset in the ABV UI, allowing you to see all traces associated with a particular evaluation run.
Contextual Span: The context manager yields root_span, which is a ABVSpan object representing the root span of this new trace.
Automatic Nesting: Any ABV observations (spans or generations) created inside the with block will automatically become children of root_span and thus part of the trace linked to this dataset item and run.

Example:

from abvdev import ABV

# ABV client initialization          
abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

dataset_name = "qna-eval"
current_run_name = "qna_model_v3_run_05_20" # Identifies this specific evaluation run

# Assume 'my_qna_app' is your instrumented application function
def my_qna_app(question: str, context: str, item_id: str, run_name: str):
    with abv.start_as_current_observation(
        as_type='generation',
        name="qna-llm-call",
        input={"question": question, "context": context},
        metadata={"item_id": item_id, "run": run_name}, # Example metadata for the generation
        model="gpt-5-2025-08-07"
    ) as generation:
        # Simulate LLM call
        answer = f"Answer to '{question}' using context." # Replace with actual LLM call
        generation.update(output={"answer": answer})

        # Update the trace with the input and output
        generation.update_trace(
            input={"question": question, "context": context},
            output={"answer": answer},
        )

        return answer

dataset = abv.get_dataset(name=dataset_name) # Fetch your pre-populated dataset

for item in dataset.items:
    print(f"Running evaluation for item: {item.id} (Input: {item.input})")

    # Use the item.run() context manager
    with item.run(
        run_name=current_run_name,
        run_metadata={"model_provider": "OpenAI", "temperature_setting": 0.7},
        run_description="Evaluation run for Q&A model v3 on May 20th"
    ) as root_span: # root_span is the root span of the new trace for this item and run.
        # All subsequent abv operations within this block are part of this trace.

        # Call your application logic
        generated_answer = my_qna_app(
            question=item.input["question"],
            context=item.input["context"],
            item_id=item.id,
            run_name=current_run_name
        )

        print(f"  Item {item.id} processed. Trace ID: {root_span.trace_id}")

        # Optionally, score the result against the expected output
        if item.expected_output and generated_answer == item.expected_output.get("answer"):
            root_span.score_trace(name="exact_match", value=1.0)
        else:
            root_span.score_trace(name="exact_match", value=0.0)

print(f"\nFinished processing dataset '{dataset_name}' for run '{current_run_name}'.")

By using item.run(), you ensure each dataset item’s processing is neatly encapsulated in its own trace, and these traces are aggregated under the specified run_name in the ABV UI. This allows for systematic review of results, comparison across runs, and deep dives into individual processing traces.

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

Python SDK - Evaluations

Scoring traces and observations

ABVSpan / ABVGeneration object methods

Context-aware methods

Low-level method

Score Parameters:

Datasets

Interacting with Datasets

Linking Traces to Dataset Items for Runs

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​Scoring traces and observations

​ABVSpan / ABVGeneration object methods

​Context-aware methods

​Low-level method

​Score Parameters:

​Datasets

​Interacting with Datasets

​Linking Traces to Dataset Items for Runs

Scoring traces and observations

ABVSpan / ABVGeneration object methods

Context-aware methods

Low-level method

Score Parameters:

Datasets

Interacting with Datasets

Linking Traces to Dataset Items for Runs