This page shows the evaluation methods that are supported by the Python SDK. Please refer to the Evaluation Overview for more information on how to evaluate your application in ABV.
Scoring traces and observations
Install packageABVSpan / ABVGeneration object methods
- span_or_generation_obj.score(): Scores the specific observation object.
- span_or_generation_obj.score_trace(): Scores the entire trace to which the object belongs.
Context-aware methods
- abv.score_current_span(): Scores the currently active observation in the context.
- abv.score_current_trace(): Scores the trace of the currently active observation.
Low-level method
- Creates a score for a specified trace_id and optionally observation_id.
- Useful when IDs are known, or for scoring after the trace/observation has completed.
Score Parameters:
| Parameter | Type | Description |
|---|---|---|
name | str | Name of the score (e.g., “relevance”, “accuracy”). Required. |
value | Union[float, str] | Score value. Float for NUMERIC/BOOLEAN, string for CATEGORICAL. Required. |
trace_id | str | ID of the trace to associate with (for create_score). Required. |
session_id | Optional[str] | ID of the specific session to score (forcreate_score). |
observation_id | Optional[str] | ID of the specific observation to score (for create_score). |
score_id | Optional[str] | Custom ID for the score (auto-generated if None). |
data_type | Optional[ScoreDataType] | "NUMERIC", "BOOLEAN", or "CATEGORICAL". Inferred if not provided based on value type and score config on server. |
comment | Optional[str] | Optional comment or explanation for the score. |
config_id | Optional[str] | Optional ID of a pre-defined score configuration in ABV. |
Datasets
ABV Datasets are essential for evaluating and testing your LLM applications by allowing you to manage collections of inputs and their expected outputs.Interacting with Datasets
- Creating: You can programmatically create new datasets with
abv.create_dataset(...)and add items to them usingabv.create_dataset_item(...). - Fetching: Retrieve a dataset and its items using
abv.get_dataset(name: str). This returns aDatasetClientinstance, which contains a list ofDatasetItemClientobjects (accessible viadataset.items). EachDatasetItemClientholds theinput,expected_output, andmetadatafor an individual data point.
Linking Traces to Dataset Items for Runs
The most powerful way to use datasets is by linking your application’s executions (traces) to specific dataset items when performing an evaluation run. See our datasets for more details. TheDatasetItemClient.run() method provides a context manager to streamline this process.
How item.run() works:
When you use with item.run(run_name="your_eval_run_name") as root_span::
- Trace Creation: A new ABV trace is initiated specifically for processing this dataset item within the context of the named run.
- Trace Naming & Metadata:
- The trace is automatically named (e.g., “Dataset run: your_eval_run_name”).
- Essential metadata is added to this trace, including
dataset_item_id(the ID ofitem),run_name, anddataset_id.
- DatasetRunItem Linking: The SDK makes an API call to ABV to create a
DatasetRunItem. This backend object formally links:- The
dataset_item_id - The
trace_idof the newly created trace - The provided
run_name - Any
run_metadataorrun_descriptionyou pass toitem.run(). This linkage is what populates the “Runs” tab for your dataset in the ABV UI, allowing you to see all traces associated with a particular evaluation run.
- The
- Contextual Span: The context manager yields
root_span, which is aABVSpanobject representing the root span of this new trace. - Automatic Nesting: Any ABV observations (spans or generations) created inside the
withblock will automatically become children ofroot_spanand thus part of the trace linked to this dataset item and run.
item.run(), you ensure each dataset item’s processing is neatly encapsulated in its own trace, and these traces are aggregated under the specified run_name in the ABV UI. This allows for systematic review of results, comparison across runs, and deep dives into individual processing traces.