Evaluations

Custom Scores

20 min

custom scores are the most flexible way to implement evaluation workflows using abv as any other evaluation method the purpose of custom scores is to assign evaluations metrics to traces , observations , sessions , or datasetruns via the score object (see scores data model docid\ hai73gnnamxtypyenpkef ) this is achieved by ingesting scores via the abv sdks or api common use cases collecting user feedback collect in app feedback from your users on application quality or performance can be captured in the frontend via our browser sdk > example notebook custom evaluation data pipeline continuously monitor the quality by fetching traces from abv, running custom evaluations, and ingesting scores back into abv > example notebook guardrails and security checks check if output contains a certain keyword, adheres to a specified structure/format or if the output is longer than a certain length > llm security & guardrails docid\ pxepsgrcbktorfrbdauxn custom internal workflow tooling build custom internal tooling that helps you manage human in the loop workflows ingest scores back into abv, optionally following your custom schema by referencing a config custom run time evaluations e g track whether the generated sql code actually worked, or if the structured output was valid json ingesting scores via api/sdks you can add scores via the abv sdks or api scores can take one of three data types numeric , categorical or boolean if a score is ingested manually using a trace id to link the score to a trace, it is not necessary to wait until the trace has been created the score will show up in the scores table and will be linked to the trace once the trace with the same trace id is created here are examples by score data types python sdk numeric numeric score values must be provided as float from abvdev import get client abv = get client() \# method 1 score via low level method abv create score( name="correctness", value=0 9, trace id="trace id here", observation id="observation id here", # optional data type="numeric", # optional, inferred if not provided comment="factually correct", # optional ) \# method 2 score current span/generation (within context) with abv start as current span(name="my operation") as span \# score the current span span score( name="correctness", value=0 9, data type="numeric", comment="factually correct" ) \# score the trace span score trace( name="overall quality", value=0 95, data type="numeric" ) \# method 3 score via the current context with abv start as current span(name="my operation") \# score the current span abv score current span( name="correctness", value=0 9, data type="numeric", comment="factually correct" ) \# score the trace abv score current trace( name="overall quality", value=0 95, data type="numeric" ) categorical categorical score values must be provided as strings from abvdev import get client abv = get client() \# method 1 score via low level method abv create score( name="accuracy", value="partially correct", trace id="trace id here", observation id="observation id here", # optional data type="categorical", # optional, inferred if not provided comment="some factual errors", # optional ) \# method 2 score current span/generation (within context) with abv start as current span(name="my operation") as span \# score the current span span score( name="accuracy", value="partially correct", data type="categorical", comment="some factual errors" ) \# score the trace span score trace( name="overall quality", value="partially correct", data type="categorical" ) \# method 3 score via the current context with abv start as current span(name="my operation") \# score the current span abv score current span( name="accuracy", value="partially correct", data type="categorical", comment="some factual errors" ) \# score the trace abv score current trace( name="overall quality", value="partially correct", data type="categorical" ) boolean boolean scores must be provided as a float the value’s string equivalent will be automatically populated and is accessible on read see api reference for more details on post/get scores endpoints from abvdev import get client abv = get client() \# method 1 score via low level method abv create score( name="helpfulness", value=0, # 0 or 1 trace id="trace id here", observation id="observation id here", # optional data type="boolean", # required, numeric values \#without data type would be inferred as numeric comment="incorrect answer", # optional ) \# method 2 score current span/generation (within context) with abv start as current span(name="my operation") as span \# score the current span span score( name="helpfulness", value=1, # 0 or 1 data type="boolean", comment="very helpful response" ) \# score the trace span score trace( name="overall quality", value=1, # 0 or 1 data type="boolean" ) \# method 3 score via the current context with abv start as current span(name="my operation") \# score the current span abv score current span( name="helpfulness", value=1, # 0 or 1 data type="boolean", comment="very helpful response" ) \# score the trace abv score current trace( name="overall quality", value=1, # 0 or 1 data type="boolean" ) js/ts sdk numeric numeric score values must be provided as float import { abvclient } from "@abvdev/client"; const abv = new abvclient(); abv score create({ id "unique id", // optional, can be used as an idempotency // key to update the score subsequently traceid message traceid, observationid message generationid, // optional name "correctness", value 0 9, datatype "numeric", // optional, inferred if not provided comment "factually correct", // optional }); // flush the scores in short lived environments await abv flush(); categorical categorical score values must be provided as strings import { abvclient } from "@abvdev/client"; const abv = new abvclient(); abv score create({ id "unique id", // optional, can be used // as an idempotency key to update the score subsequently traceid message traceid, observationid message generationid, // optional name "accuracy", value "partially correct", datatype "categorical", // optional, inferred if not provided comment "factually correct", // optional }); // flush the scores in short lived environments await abv flush(); boolean boolean scores must be provided as a float the value’s string equivalent will be automatically populated and is accessible on read see api reference api reference f or more details on post/get scores endpoints import { abvclient } from "@abvdev/client"; const abv = new abvclient(); abv score create({ id "unique id", // optional, can be used as an // idempotency key to update the score subsequently traceid message traceid, observationid message generationid, // optional name "helpfulness", value 0, // 0 or 1 datatype "boolean", // required, numeric values without // data type would be inferred as numeric comment "incorrect answer", // optional }); // flush the scores in short lived environments await abv flush(); → more details in python sdk overview docid\ cz6vqpywkb01wgnqc8un and js/ts sdk docs https //docs abv dev/jsts sdk#8np3b see api reference https //docs abv dev/public api for more details on post/get score configs endpoints preventing duplicate scores by default, abv allows for multiple scores of the same name on the same trace this is useful if you'd like to track the evolution of a score over time or if e g you've received multiple user feedback scores on the same trace in some cases, you want to prevent this behavior or update an existing score this can be achieved by creating an idempotency key on the score and add this as an id when creating the score, e g \<trace id> \<score name> enforcing a score config score configs are helpful when you want to standardize your scores for future analysis to enforce a score config, you can provide a configid when creating a score to reference a scoreconfig that was previously created score configs can be defined in the abv ui or via our api see our guide on how to create and manage score configs whenever you provide a scoreconfig , the score data will be validated against the config the following rules apply score name must equal the config's name score data type when provided, must match the config's data type score value when type is numeric value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as ∞ and +∞ respectively) score value when type is categorical value must map to one of the categories defined in the config score value when type is boolean value must equal 0 or 1 python sdk numeric scores when ingesting numeric scores, you can provide the value as a float if you provide a configid, the score value will be validated against the config's numeric range, which might be defined by a minimum and/or maximum value from abvdev import get client abv = get client() \# method 1 score via low level method abv create score( trace id="trace id here", observation id="observation id here", # optional session id="session id here", # optional, id of the session the score relates to name="accuracy", value=0 9, comment="factually correct", # optional score id="unique id", # optional, can be used \# as an idempotency key to update the score subsequently config id="78545 6565 3453654 43543", # optional, \# to ensure that the score follows a specific min/max value range data type="numeric" # optional, possibly inferred ) \# method 2 score within context with abv start as current span(name="my operation") as span span score( name="accuracy", value=0 9, comment="factually correct", config id="78545 6565 3453654 43543", data type="numeric" ) categorical scores categorical scores are used to evaluate data that falls into specific categories when ingesting categorical scores, you can provide the value as a string if you provide a configid, the score value will be v alidated against the config’s categories from abvdev import get client abv = get client() \# method 1 score via low level method abv create score( trace id="trace id here", observation id="observation id here", # optional name="correctness", value="correct", comment="factually correct", # optional score id="unique id", # optional, can be used as an idempotency \# key to update the score subsequently config id="12345 6565 3453654 43543", # optional, to ensure that \# the score maps to a specific category defined in a score config data type="categorical" # optional, possibly inferred ) \# method 2 score within context with abv start as current span(name="my operation") as span span score( name="correctness", value="correct", comment="factually correct", config id="12345 6565 3453654 43543", data type="categorical" ) boolean scores when ingesting boolean scores, you can provide the value as a float if you provide a configid, the score’s name and config’s name must match as well as their data types from abvdev import get client abv = get client() \# method 1 score via low level method abv create score( trace id="trace id here", observation id="observation id here", # optional name="helpfulness", value=1, comment="factually correct", # optional score id="unique id", # optional, can be used as an \# idempotency key to update the score subsequently config id="93547 6565 3453654 43543", # optional, can \# be used to infer the score data type and validate the score value data type="boolean" # optional, possibly inferred ) \# method 2 score within context with abv start as current span(name="my operation") as span span score( name="helpfulness", value=1, comment="factually correct", config id="93547 6565 3453654 43543", data type="boolean" ) js/ts sdk numeric scores when ingesting numeric scores, you can provide the value as a float if you provide a configid, the score value will be validated against the config's numeric range, which might be defined by a minimum and/or maximum value import { abvclient } from "@abvdev/client"; const abv = new abvclient(); abv score create({ traceid message traceid, observationid message generationid, // optional name "accuracy", value 0 9, comment "factually correct", // optional id "unique id", // optional, can be used as an // idempotency key to update the score subsequently configid "78545 6565 3453654 43543", // optional, // to ensure that the score follows a specific min/max value range datatype "numeric", // optional, possibly inferred }); // flush the scores in short lived environments await abv flush(); categorical scores categorical scores are used to evaluate data that falls into specific categories when ingesting categorical scores, you can provide the value as a string if you provide a configid, the score value will be validated against the config’s categories import { abvclient } from "@abvdev/client"; const abv = new abvclient(); abv score create({ traceid message traceid, observationid message generationid, // optional name "correctness", value "correct", comment "factually correct", // optional id "unique id", // optional, can be used as an idempotency // key to update the score subsequently configid "12345 6565 3453654 43543", // optional, to ensure that // a score maps to a specific category defined in a score config datatype "categorical", // optional, possibly inferred }); // flush the scores in short lived environments await abv flush(); boolean scores when ingesting boolean scores, you can provide the value as a float if you provide a configid, the score’s name and config’s na me must match a s well as their data types import { abvclient } from "@abvdev/client"; const abv = new abvclient(); abv score create({ traceid message traceid, observationid message generationid, // optional name "helpfulness", value 1, comment "factually correct", // optional id "unique id", // optional, can be used as an // idempotency key to update the score subsequently configid "93547 6565 3453654 43543", // optional, // can be used to infer the score data type and validate the score value datatype "boolean", // optional, possibly inferred }); // flush the scores in short lived environments await abv flush(); → more details in python sdk docs https //docs abv dev/python sdk#tufco and js/ts sdk docs https //docs abv dev/jsts sdk#8np3b see api reference api reference for more details on post/get score configs endpoints inferred score properties certain score properties might be inferred based on your input if you don't provide a score data type it will always be inferred see tables below for details for boolean and categorical scores , we will provide the score value in both numerical and string format where possible the score value format that is not provided as input, i e the translated value is referred to as the inferred value in the tables below on read for boolean scores both numerical and string representations of the score value will be returned, e g both 1 and true for categorical scores , the string representation is always provided and a numerical mapping of the category will be produced only if a scoreconfig was provided detailed examples numeric scores for example, let's assume you'd like to ingest a numeric score to measure accuracy we have included a table of possible score ingestion scenarios below value data type config id description inferred data type valid 0 9 null null data type is inferred numeric yes 0 9 numeric null no properties inferred yes depth numeric null error data type of value does not match provided data type no 0 9 numeric 78545 no properties inferred conditional on config validation 0 9 null 78545 data type inferred numeric conditional on config validation depth numeric 78545 error data type of value does not match provided data type no categorical scores for example, let’s assume you’d like to ingest a categorical score to measure correctness we have included a table of possible score ingestion scenarios below value data type config id description inferred data type inferred value representation valid correct null null data type is inferred categorical yes correct categorical null no properties inferred yes 1 categorical null error data type of value does not match provided data type no correct categorical 12345 numeric value inferred 4 numeric config category mapping conditional on config validation correct null 12345 data type inferred categorical conditional on config validation 1 categorical 12345 error data type of value does not match provided data type no boolean scores for example, let’s assume you’d like to ingest a boolean score to measure helpfulness we have included a table of possible score ingestion scenarios below value data type config id description inferred data type inferred value representation valid 1 boolean null value's string equivalent inferred true yes true boolean null error data type of value does not match provided data type no 3 boolean null error boolean data type expects 0 or 1 as input value no 0 9 null 93547 data type and value's string equivalent inferred boolean true conditional on config validation depth boolean 93547 error data type of value does not match provided data type no update existing scores via api/sdks when creating a score, you can provide an optional id parameter this will update the score if it already exists within your project if you want to update a score without needing to fetch the list of existing scores from abv, you can set your own id parameter as an idempotency key when initially creating the score