Python SDK - Evaluations

8 min

the abv python sdk provides ways to evaluate your application you can add custom scores to your traces and observations, or use the sdk to execute dataset runs this page shows the evaluation methods that are supported by the python sdk please refer to the evaluation overview docid\ xmql64v6abqxjdsa49l84 for more information on how to evaluate your application in abv scoring traces and observations abvspan / abvgeneration object methods | span or generation obj score() scores the specific observation object span or generation obj score trace() scores the entire trace to which the object belongs from abvdev import get client abv = get client() with abv start as current generation(name="summary generation") as gen \# llm call gen update(output="summary text ") \# score this specific generation gen score(name="conciseness", value=0 8, data type="numeric") \# score the overall trace gen score trace(name="user feedback rating", value="positive", data type="categorical") context aware methods abv score current span() scores the currently active observation in the context abv score current trace() scores the trace of the currently active observation from abvdev import get client abv = get client() with abv start as current span(name="complex task") as task span \# perform task abv score current span(name="task component quality", value=true, data type="boolean") \# if task is fully successful abv score current trace(name="overall success", value=1 0, data type="numeric") low level method creates a score for a specified trace id and optionally observation id useful when ids are known, or for scoring after the trace/observation has completed from abvdev import get client abv = get client() abv create score( name="fact check accuracy", value=0 95, # can be float for numeric/boolean, string for categorical trace id="abcdef1234567890abcdef1234567890", observation id="1234567890abcdef", # optional if scoring a specific observation session id="session 123", # optional if scoring a specific session data type="numeric", # "numeric", "boolean", "categorical" comment="source verified for 95% of claims " ) score parameters parameter type description name str name of the score (e g , "relevance", "accuracy") required value union\[float, str] score value float for numeric / boolean , string for categorical required trace id str id of the trace to associate with (for create score ) required session id optional\[str] id of the specific session to score (for create score ) observation id optional\[str] id of the specific observation to score (for create score ) score id optional\[str] custom id for the score (auto generated if none) data type optional\[scoredatatype] "numeric" , "boolean" , or "categorical" inferred if not provided based on value type and score config on server comment optional\[str] optional comment or explanation for the score config id optional\[str] optional id of a pre defined score configuration in abv see scoring https //docs abv dev/evaluation overview for more details datasets abv datasets https //docs abv dev/datasets are essential for evaluating and testing your llm applications by allowing you to manage collections of inputs and their expected outputs interacting with datasets creating you can programmatically create new datasets with abv create dataset( ) and add items to them using abv create dataset item( ) fetching retrieve a dataset and its items using abv get dataset(name str) this returns a datasetclient instance, which contains a list of datasetitemclient objects (accessible via dataset items ) each datasetitemclient holds the input , expected output , and metadata for an individual data point from abvdev import get client abv = get client() \# fetch an existing dataset dataset = abv get dataset(name="my eval dataset") for item in dataset items print(f"input {item input}, expected {item expected output}") \# briefly creating a dataset and an item new dataset = abv create dataset(name="new summarization tasks") abv create dataset item( dataset name="new summarization tasks", input={"text" "long article "}, expected output={"summary" "short summary "} ) linking traces to dataset items for runs the most powerful way to use datasets is by linking your application's executions (traces) to specific dataset items when performing an evaluation run see our datasets documentation https //docs abv dev/datasets for more details the datasetitemclient run() method provides a context manager to streamline this process how item run() works when you use with item run(run name="your eval run name") as root span trace creation a new abv trace is initiated specifically for processing this dataset item within the context of the named run trace naming & metadata the trace is automatically named (e g , "dataset run your eval run name") essential metadata is added to this trace, including dataset item id (the id of item ), run name , and dataset id datasetrunitem linking the sdk makes an api call to abv to create a datasetrunitem this backend object formally links the dataset item id the trace id of the newly created trace the provided run name any run metadata or run description you pass to item run() this linkage is what populates the "runs" tab for your dataset in the abv ui, allowing you to see all traces associated with a particular evaluation run contextual span the context manager yields root span , which is a abvspan object representing the root span of this new trace automatic nesting any abv observations (spans or generations) created inside the with block will automatically become children of root span and thus part of the trace linked to thi s dataset item and run example from abvdev import get client abv = get client() dataset name = "qna eval" current run name = "qna model v3 run 05 20" # identifies this specific evaluation run \# assume 'my qna app' is your instrumented application function def my qna app(question str, context str, item id str, run name str) with abv start as current generation( name="qna llm call", input={"question" question, "context" context}, metadata={"item id" item id, "run" run name}, # example metadata for the generation model="gpt 5 2025 08 07" ) as generation \# simulate llm call answer = f"answer to '{question}' using context " # replace with actual llm call generation update(output={"answer" answer}) \# update the trace with the input and output generation update trace( input={"question" question, "context" context}, output={"answer" answer}, ) return answer dataset = abv get dataset(name=dataset name) # fetch your pre populated dataset for item in dataset items print(f"running evaluation for item {item id} (input {item input})") \# use the item run() context manager with item run( run name=current run name, run metadata={"model provider" "openai", "temperature setting" 0 7}, run description="evaluation run for q\&a model v3 on may 20th" ) as root span # root span is the root span of the new trace for this item and run \# all subsequent abv operations within this block are part of this trace \# call your application logic generated answer = my qna app( question=item input\["question"], context=item input\["context"], item id=item id, run name=current run name ) print(f" item {item id} processed trace id {root span trace id}") \# optionally, score the result against the expected output if item expected output and generated answer == item expected output get("answer") root span score trace(name="exact match", value=1 0) else root span score trace(name="exact match", value=0 0) print(f"\nfinished processing dataset '{dataset name}' for run '{current run name}' ") by using item run() , you ensure each dataset item's processing is neatly encapsulated in its own trace, and these traces are aggregated under the specified run name in the abv ui this allows for systematic review of results, comparison across runs, and deep dives into individual processing traces