Skip to main content
Custom Scores are the most flexible way to implement evaluation workflows using ABV. As any other evaluation method the purpose of custom scores is to assign evaluations metrics to Traces, Observations, Sessions, or DatasetRuns via the Score object (see Scores Data Model). This is achieved by ingesting scores via the ABV SDKs or API.

Common Use Cases

  • Collecting user feedback: Capture in-app feedback from users on application quality or performance via the Browser SDK.
  • Custom evaluation data pipeline: Continuously monitor quality by fetching traces from ABV, running custom evaluations, and ingesting scores back.
  • Custom internal workflow tooling: build custom internal tooling that helps you manage human-in-the-loop workflows. Ingest scores back into ABV, optionally following your custom schema by referencing a config.
  • Custom run-time evaluations: e.g. track whether the generated SQL code actually worked, or if the structured output was valid JSON.

Ingesting Scores via API/SDKs

You can add scores via the ABV SDKs or API. Scores can take one of three data types: Numeric, Categorical or Boolean. If a score is ingested manually using a trace_id to link the score to a trace, it is not necessary to wait until the trace has been created. The score will show up in the scores table and will be linked to the trace once the trace with the same trace_id is created. Here are examples by Score data types

Python SDK

Install package
pip install abvdev
Numeric Numeric score values must be provided as float.
from abvdev import ABV

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Method 1: Score via low-level method
abv.create_score(
    name="correctness",
    value=0.9,
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    
    data_type="NUMERIC", # optional, inferred if not provided
    comment="Factually correct", # optional
)

# Method 2: Score current span/generation (within context)
with abv.start_as_current_span(name="my-operation") as span:
    # Score the current span
    span.score(
        name="correctness",
        value=0.9,
        data_type="NUMERIC",
        comment="Factually correct"
    )

    # Score the trace
    span.score_trace(
        name="overall_quality",
        value=0.95,
        data_type="NUMERIC"
    )


# Method 3: Score via the current context
with abv.start_as_current_span(name="my-operation"):
    # Score the current span
    abv.score_current_span(
        name="correctness",
        value=0.9,
        data_type="NUMERIC",
        comment="Factually correct"
    )

    # Score the trace
    abv.score_current_trace(
        name="overall_quality",
        value=0.95,
        data_type="NUMERIC"
    )

Categorical

Categorical score values must be provided as strings.
from abvdev import ABV

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Method 1: Score via low-level method
abv.create_score(
    name="accuracy",
    value="partially correct",
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    data_type="CATEGORICAL", # optional, inferred if not provided
    comment="Some factual errors", # optional
)

# Method 2: Score current span/generation (within context)
with abv.start_as_current_span(name="my-operation") as span:
    # Score the current span
    span.score(
        name="accuracy",
        value="partially correct",
        data_type="CATEGORICAL",
        comment="Some factual errors"
    )

    # Score the trace
    span.score_trace(
        name="overall_quality",
        value="partially correct",
        data_type="CATEGORICAL"
    )

# Method 3: Score via the current context
with abv.start_as_current_span(name="my-operation"):
    # Score the current span
    abv.score_current_span(
        name="accuracy",
        value="partially correct",
        data_type="CATEGORICAL",
        comment="Some factual errors"
    )

    # Score the trace
    abv.score_current_trace(
        name="overall_quality",
        value="partially correct",
        data_type="CATEGORICAL"
    )

Boolean

Boolean scores must be provided as a float. The value’s string equivalent will be automatically populated and is accessible on read.
from abvdev import ABV

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Method 1: Score via low-level method
abv.create_score(
    name="helpfulness",
    value=0, # 0 or 1
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    data_type="BOOLEAN", # required, numeric values 
    #without data type would be inferred as NUMERIC
    comment="Incorrect answer", # optional
)

# Method 2: Score current span/generation (within context)
with abv.start_as_current_span(name="my-operation") as span:
    # Score the current span
    span.score(
        name="helpfulness",
        value=1, # 0 or 1
        data_type="BOOLEAN",
        comment="Very helpful response"
    )

    # Score the trace
    span.score_trace(
        name="overall_quality",
        value=1, # 0 or 1
        data_type="BOOLEAN"
    )
# Method 3: Score via the current context
with abv.start_as_current_span(name="my-operation"):
    # Score the current span
    abv.score_current_span(
        name="helpfulness",
        value=1, # 0 or 1
        data_type="BOOLEAN",
        comment="Very helpful response"
    )

    # Score the trace
    abv.score_current_trace(
        name="overall_quality",
        value=1, # 0 or 1
        data_type="BOOLEAN"
    )

JS/TS SDK

npm i @abvdev/client 
Environment variables Add your ABV credentials as environment variables, e.g. use .env file and dotenv package to load variable values.
npm install dotenv
.env
ABV_API_KEY="sk-abv-..."
ABV_BASEURL="https://app.abv.dev" # US region
# ABV_BASEURL="https://eu.app.abv.dev" # EU region
import { ABVClient } from "@abvdev/client";
 
const abv = new ABVClient();
alternatively use Constructor parameters
import { ABVClient } from "@abvdev/client";
 
const abv = new ABVClient({
  apiKey: "sk-abv-...",
  baseUrl: "https://app.abv.dev", // US region
  // baseUrl: "https://eu.app.abv.dev", // EU region
});

Numeric

Numeric score values must be provided as float.
import { ABVClient } from "@abvdev/client";
import dotenv from "dotenv";
dotenv.config();
 
const abv = new ABVClient();
 
abv.score.create({
  id: "unique_id", // optional, can be used as an idempotency
  // key to update the score subsequently
  traceId: "target_trace_id_here",
  observationId: "target_observation_id_here", // optional
  name: "correctness",
  value: 0.9,
  dataType: "NUMERIC", // optional, inferred if not provided
  comment: "Factually correct", // optional
});
 
async function main() {
  // Flush the scores in short-lived environments
  await abv.flush();
}

main();

Categorical

Categorical score values must be provided as strings.
import { ABVClient } from "@abvdev/client";
import dotenv from "dotenv";
dotenv.config();
 
const abv = new ABVClient();
 
abv.score.create({
  id: "unique_id", // optional, can be used 
  // as an idempotency key to update the score subsequently
  traceId: "target_trace_id_here",
  observationId: "target_observation_id_here", // optional
  name: "accuracy",
  value: "partially correct",
  dataType: "CATEGORICAL", // optional, inferred if not provided
  comment: "Factually correct", // optional
});
 
async function main() {
  // Flush the scores in short-lived environments
  await abv.flush();
}

main();

Boolean

Boolean scores must be provided as a float. The value’s string equivalent will be automatically populated and is accessible on read. See API reference for more details on POST/GET scores endpoints.
import { ABVClient } from "@abvdev/client";
import dotenv from "dotenv";
dotenv.config();

const abv = new ABVClient();
 
abv.score.create({
  id: "unique_id", // optional, can be used as an 
  // idempotency key to update the score subsequently
  traceId: "target_trace_id_here",
  observationId: "target_observation_id_here", // optional
  name: "helpfulness",
  value: 0, // 0 or 1
  dataType: "BOOLEAN", // required, numeric values without
  // data type would be inferred as NUMERIC
  comment: "Incorrect answer", // optional
});
 
async function main() {
  // Flush the scores in short-lived environments
  await abv.flush();
}

main();
→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.

Preventing Duplicate Scores

By default, ABV allows for multiple scores of the same name on the same trace. This is useful if you’d like to track the evolution of a score over time or if e.g. you’ve received multiple user feedback scores on the same trace. In some cases, you want to prevent this behavior or update an existing score. This can be achieved by creating an idempotency key on the score and add this as an id when creating the score, e.g. <trace_id>-<score_name>.

Enforcing a Score Config

Score configs are helpful when you want to standardize your scores for future analysis. To enforce a score config, you can provide a configId when creating a score to reference a ScoreConfig that was previously created. Score Configs can be defined in the ABV UI or via our API. . Whenever you provide a ScoreConfig, the score data will be validated against the config. The following rules apply:
  • Score Name: Must equal the config’s name
  • Score Data Type: When provided, must match the config’s data type
  • Score Value when Type is numeric: Value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as -∞ and +∞ respectively)
  • Score Value when Type is categorical: Value must map to one of the categories defined in the config
  • Score Value when Type is boolean: Value must equal 0 or 1

Python SDK

Numeric Scores When ingesting numeric scores, you can provide the value as a float. If you provide a configId, the score value will be validated against the config’s numeric range, which might be defined by a minimum and/or maximum value.
from abvdev import ABV

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Method 1: Score via low-level method
abv.create_score(
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    session_id="session_id_here", # optional, Id of the session the score relates to
    name="accuracy",
    value=0.9,
    comment="Factually correct", # optional
    score_id="unique_id", # optional, can be used 
    # as an idempotency key to update the score subsequently
    config_id="78545-6565-3453654-43543", # optional, 
    # to ensure that the score follows a specific min/max value range
    data_type="NUMERIC" # optional, possibly inferred
)

# Method 2: Score within context
with abv.start_as_current_span(name="my-operation") as span:
    span.score(
        name="accuracy",
        value=0.9,
        comment="Factually correct",
        config_id="78545-6565-3453654-43543",
        data_type="NUMERIC"
    )
Categorical Scores Categorical scores are used to evaluate data that falls into specific categories. When ingesting categorical scores, you can provide the value as a string. If you provide a configId, the score value will be validated against the config’s categories.
from abvdev import ABV

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Method 1: Score via low-level method
abv.create_score(
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    name="correctness",
    value="correct",
    comment="Factually correct", # optional
    score_id="unique_id", # optional, can be used as an idempotency 
    # key to update the score subsequently
    config_id="12345-6565-3453654-43543", # optional, to ensure that 
    # the score maps to a specific category defined in a score config
    data_type="CATEGORICAL" # optional, possibly inferred
)

# Method 2: Score within context
with abv.start_as_current_span(name="my-operation") as span:
    span.score(
        name="correctness",
        value="correct",
        comment="Factually correct",
        config_id="12345-6565-3453654-43543",
        data_type="CATEGORICAL"
    )
Boolean Scores When ingesting boolean scores, you can provide the value as a float. If you provide a configId, the score’s name and config’s name must match as well as their data types.
from abvdev import ABV

abv = ABV(
    api_key="sk-abv-...", # your api key here
    host="https://app.abv.dev", # host="https://eu.app.abv.dev", for EU region
)

# Method 1: Score via low-level method
abv.create_score(
    trace_id="trace_id_here",
    observation_id="observation_id_here", # optional
    name="helpfulness",
    value=1,
    comment="Factually correct", # optional
    score_id="unique_id", # optional, can be used as an 
    # idempotency key to update the score subsequently
    config_id="93547-6565-3453654-43543", # optional, can 
    # be used to infer the score data type and validate the score value
    data_type="BOOLEAN" # optional, possibly inferred
)

# Method 2: Score within context
with abv.start_as_current_span(name="my-operation") as span:
    span.score(
        name="helpfulness",
        value=1,
        comment="Factually correct",
        config_id="93547-6565-3453654-43543",
        data_type="BOOLEAN"
    )

JS/TS SDK

Numeric Scores When ingesting numeric scores, you can provide the value as a float. If you provide a configId, the score value will be validated against the config’s numeric range, which might be defined by a minimum and/or maximum value.
import { ABVClient } from "@abvdev/client";
import dotenv from "dotenv";
dotenv.config();
 
const abv = new ABVClient();
 
abv.score.create({
  traceId: "target_trace_id_here",
  observationId: "target_observation_id_here", // optional
  name: "accuracy",
  value: 7,
  comment: "Factually correct", // optional
  id: "unique_id", // optional, can be used as an 
  // idempotency key to update the score subsequently
  configId: "config-id-here", // optional, 
  // to ensure that the score follows a specific min/max value range
  dataType: "NUMERIC", // optional, possibly inferred
});
 
async function main() {
    // Flush the scores in short-lived environments
    await abv.flush();
}

main();
Categorical Scores Categorical scores are used to evaluate data that falls into specific categories. When ingesting categorical scores, you can provide the value as a string. If you provide a configId, the score value will be validated against the config’s categories.
import { ABVClient } from "@abvdev/client";
import dotenv from "dotenv";
dotenv.config();
 
const abv = new ABVClient();
 
abv.score.create({
  id: "unique_id", // optional, can be used 
  // as an idempotency key to update the score subsequently
  traceId: "target_trace_id_here",
  observationId: "target_observation_id_here", // optional
  name: "accuracy",
  value: "partially correct",
  // key to update the score subsequently
  configId: "config-id-here", // optional, to ensure that 
  // a score maps to a specific category defined in a score config
  dataType: "CATEGORICAL", // optional, inferred if not provided
  comment: "Factually correct", // optional
});
 
async function main() {
    // Flush the scores in short-lived environments
    await abv.flush();
}

Boolean Scores When ingesting boolean scores, you can provide the value as a float. If you provide a configId, the score’s name and config’s name must match as well as their data types.
import { ABVClient } from "@abvdev/client";
import dotenv from "dotenv";
dotenv.config();

const abv = new ABVClient();
 
abv.score.create({
  id: "unique_id", // optional, can be used as an 
  // idempotency key to update the score subsequently
  traceId: "cb35f468686ad95603029f404004d456",
  observationId: "f7145d410802f3fe", // optional
  name: "helpfulness",
  value: 0, // 0 or 1
  configId: "config-id-here", // optional, 
  // can be used to infer the score data type and validate the score value
  dataType: "BOOLEAN", // required, numeric values without
  // data type would be inferred as NUMERIC
  comment: "Incorrect answer", // optional
});
 
async function main() {
  // Flush the scores in short-lived environments
  await abv.flush();
}

main();
→ More details in Python SDK docs and JS/TS SDK docs. See API reference for more details on POST/GET score configs endpoints.

Inferred Score Properties

Certain score properties might be inferred based on your input:
  • If you don’t provide a score data type it will always be inferred. See tables below for details.
  • For boolean and categorical scores, we will provide the score value in both numerical and string format where possible. The score value format that is not provided as input, i.e. the translated value is referred to as the inferred value in the tables below.
  • On read for boolean scores both numerical and string representations of the score value will be returned, e.g. both 1 and True.
  • For categorical scores, the string representation is always provided and a numerical mapping of the category will be produced only if a ScoreConfig was provided.
Detailed Examples:

Numeric Scores

For example, let’s assume you’d like to ingest a numeric score to measure accuracy. We have included a table of possible score ingestion scenarios below.
ValueData TypeConfig IdDescriptionInferred Data TypeValid
0.9NullNullData type is inferredNUMERICYes
0.9NUMERICNullNo properties inferredYes
depthNUMERICNullError: data type of value does not match provided data typeNo
0.9NUMERIC78545No properties inferredConditional on config validation
0.9Null78545Data type inferredNUMERICConditional on config validation
depthNUMERIC78545Error: data type of value does not match provided data typeNo

Categorical Scores

For example, let’s assume you’d like to ingest a categorical score to measure correctness. We have included a table of possible score ingestion scenarios below.
ValueData TypeConfig IdDescriptionInferred Data TypeInferred Value representationValid
correctNullNullData type is inferredCATEGORICALYes
correctCATEGORICALNullNo properties inferredYes
1CATEGORICALNullError: data type of value does not match provided data typeNo
correctCATEGORICAL12345Numeric value inferred4 numeric config category mappingConditional on config validation
correctNULL12345Data type inferredCATEGORICALConditional on config validation
1CATEGORICAL12345Error: data type of value does not match provided data typeNo

Boolean Scores

For example, let’s assume you’d like to ingest a boolean score to measure helpfulness. We have included a table of possible score ingestion scenarios below.
ValueData TypeConfig IdDescriptionInferred Data TypeInferred Value representationValid
1BOOLEANNullValue’s string equivalent inferredTrueYes
trueBOOLEANNullError: data type of value does not match provided data typeNo
3BOOLEANNullError: boolean data type expects 0 or 1 as input valueNo
0.9Null93547Data type and value’s string equivalent inferredBOOLEANTrueConditional on config validation
depthBOOLEAN93547Error: data type of value does not match provided data typeNo

Update Existing Scores via API/SDKs

When creating a score, you can provide an optional id parameter. This will update the score if it already exists within your project. If you want to update a score without needing to fetch the list of existing scores from ABV, you can set your own id parameter as an idempotency key when initially creating the score.