Basic Features

Model Usage & Cost Tracking

15 min

abv tracks the usage and costs of your llm generations and provides breakdowns by usage types usage details number of units consumed per usage type cost details usd cost per usage type usage types can be arbitrary strings and differ by llm provider at the highest level, they can be simply input and output as llms grow more sophisticated, additional usage types are necessary, such as cached tokens , audio tokens , image tokens in the ui, abv summarizes all usage types that include the string input as input usage types, similarly output as output usage types if no total usage type is ingested, abv sums up all usage type units to a total both usage details and cost details can be either model usage & cost tracking /#ingest usage andor cost via api, sdks or integrations or model usage & cost tracking /#infer usage andor cost based on the model parameter of the generation abv comes with a list of predefined popular models and their tokenizers including openai, anthropic, and google models you can also add your own model usage & cost tracking /#custom model definitions or request official support for new models inferred cost are calculated at the time of ingestion with the model and price information available at that point in time ingested usage and cost are prioritized over inferred usage and cost via the metrics api docid hf gufraxxajgvwkugh6 , you can retrieve aggregated daily usage and cost metrics from abv for downstream use in analytics, billing, and rate limiting the api allows you to filter by application type, user, or tags ingest usage and/or cost if available in the llm response, ingesting usage and/or cost is the most accurate and robust way to track usage in abv many of the abv integrations automatically capture usage details and cost details data from the llm response if this does not work as expected, please write to support python sdk with decorator from abvdev import observe, get client import anthropic abv = get client() anthropic client = anthropic anthropic() @observe(as type="generation") def anthropic completion( kwargs) \# optional, extract some fields from kwargs kwargs clone = kwargs copy() input = kwargs clone pop('messages', none) model = kwargs clone pop('model', none) abv update current generation( input=input, model=model, metadata=kwargs clone ) response = anthropic client messages create( kwargs) abv update current generation( usage details={ "input" response usage input tokens, "output" response usage output tokens, "cache read input tokens" response usage cache read input tokens \# "total" int, # if not set, it is derived from input + cache read input tokens + output }, \# optionally, also ingest usd cost alternatively, you can infer it via a model definition in abv cost details={ \# here we assume the input and output cost are 1 usd each and half the price for cached tokens "input" 1, "cache read input tokens" 0 5, "output" 1, \# "total" float, # if not set, it is derived from input + cache read input tokens + output } ) \# return result return response content\[0] text @observe() def main() return anthropic completion( model="gpt 5 2025 08 07", max tokens=1024, messages=\[ {"role" "user", "content" "hello, claude"} ] ) main() manual from abvdev import get client import anthropic abv = get client() anthropic client = anthropic anthropic() with abv start as current generation( name="anthropic completion", model="gpt 5 2025 08 07", input=\[{"role" "user", "content" "hello, claude"}] ) as generation response = anthropic client messages create( model="claude 3 haiku 20240307", max tokens=1024, messages=\[{"role" "user", "content" "hello, claude"}] ) generation update( output=response content\[0] text, usage details={ "input" response usage input tokens, "output" response usage output tokens, "cache read input tokens" response usage cache read input tokens \# "total" int, # if not set, it is derived from input + cache read input tokens + output }, \# optionally, also ingest usd cost alternatively, you can infer it via a model definition in abv cost details={ \# here we assume the input and output cost are 1 usd each and half the price for cached tokens "input" 1, "cache read input tokens" 0 5, "output" 1, \# "total" float, # if not set, it is derived from input + cache read input tokens + output } ) js/ts sdk when using the context manager import { startactiveobservation, startobservation, updateactivetrace, updateactiveobservation, } from "@abvdev/tracing"; await startactiveobservation("context manager", async (span) => { span update({ input { query "what is the capital of france?" }, }); // this generation will automatically be a child of "user request" const generation = startobservation( "llm call", { model "gpt 5 2025 08 07", input \[{ role "user", content "what is the capital of france?" }], }, { astype "generation" } ); // llm call logic generation update({ usagedetails { input 10, output 5, cache read input tokens 2, some other token count 10, total 17, // optional, it is derived from input + cache read input tokens + output }, costdetails { // if you don't want the costs to be calculated based on model definitions, you can pass the costdetails manually input 1, output 1, cache read input tokens 0 5, some other token count 1, total 3 5, }, output { content "the capital of france is paris " }, }); generation end(); }); when using the observe wrapper import { observe, updateactiveobservation } from "@abvdev/tracing"; // an existing function async function fetchdata(source string) { updateactiveobservation( { usagedetails { input 10, output 5, cache read input tokens 2, some other token count 10, total 17, // optional, it is derived from input + cache read input tokens + output }, costdetails { // if you don't want the costs to be calculated based on model definitions, you can pass the costdetails manually input 1, output 1, cache read input tokens 0 5, some other token count 1, total 3 5, }, }, { astype "generation" } ); // logic to fetch data return { data `some data from ${source}` }; } // wrap the function to trace it const tracedfetchdata = observe(fetchdata, { name "observe wrapper", astype "generation", }); const result = await tracedfetchdata("api"); when creating spans manually const span = startobservation("manual observation", { input { query "what is the capital of france?" }, }); const generation = span startobservation( "llm call", { model "gpt 5 2025 08 07", input \[{ role "user", content "what is the capital of france?" }], output { content "the capital of france is paris " }, }, { astype "generation" } ); generation update({ usagedetails { input 10, output 5, cache read input tokens 2, some other token count 10, total 17, // optional, it is derived from input + cache read input tokens + output }, costdetails { // if you don't want the costs to be calculated based on model definitions, you can pass the costdetails manually input 1, output 1, cache read input tokens 0 5, some other token count 1, total 3 5, }, }); generation update({ output { content "the capital of france is paris " }, }) end(); span update({ output "successfully answered user request " }) end(); you can also update the usage and cost via generation update() compatibility with openai for increased compatibility with openai, you can also use the openai usage schema prompt tokens will be mapped to input , completion tokens will be mapped to output , and total tokens will be mapped to total the keys nested in prompt tokens details will be flattened with an input prefix and completion tokens details will be flattened with an output prefix python sdk from abvdev import get client abv = get client() with abv start as current generation( name="openai style generation", model="gpt 5 2025 08 07" ) as generation \# simulate llm call \# response = openai client chat completions create( ) generation update( usage details={ \# usage (openai style schema) "prompt tokens" 10, "completion tokens" 25, "total tokens" 35, "prompt tokens details" { "cached tokens" 5, "audio tokens" 2, }, "completion tokens details" { "reasoning tokens" 15, }, } ) js/ts sdk const generation = startobservation( "name", { // usagedetails { // usage prompt tokens integer, completion tokens integer, total tokens integer, prompt tokens details { cached tokens integer, audio tokens integer, }, completion tokens details { reasoning tokens integer, }, }, // }, { astype "generation" } ); other you can also ingest openai style usage via generation update() and generation end() infer usage and/or cost if either usage or cost are not ingested, abv will attempt to infer the missing values based on the model parameter of the generation at the time of ingestion this is especially useful for some model providers which do not include usage or cost in the response abv comes with a list of predefined popular models and their tokenizers including openai, anthropic, google check out the full list (you need to sign in) you can also add your own custom model definitions (see model usage & cost tracking /#custom model definitions ) or request official support for new models usage if a tokenizer is specified for the model, abv automatically calculates token amounts for ingested generations the following tokenizers are currently supported model tokenizer used package comment gpt 4o o200k base tiktoken gpt cl100k base tiktoken claude claude @anthropic ai/tokenizer according to anthropic, their tokenizer is not accurate for claude 3 models if possible, send us the tokens from their api response cost model definitions include prices per usage type usage types must match exactly with the keys in the usage details object of the generation abv automatically calculates cost for ingested generations at the time of ingestion if (1) usage is ingested or inferred, (2) and a matching model definition includes prices custom model definitions you can flexibly add your own model definitions to abv add own model via ui model definitions can also be managed programmatically via the models public api docid\ ezykvdguou0bjbdr0cynx get /api/public/models post /api/public/models get /api/public/models/{id} delete /api/public/models/{id} models are matched to generations based on generation attribute model attribute notes model match pattern uses regular expressions, e g (?i)^(gpt 4 0125 preview)$ matches gpt 4 0125 preview user defined models take priority over models maintained by abv further details when using the openai tokenizer, you need to specify the following tokenization config you can also copy the config from the list of predefined openai models see the openai documentation for further details tokenspername and tokenspermessage are required for chat models { "tokenizermodel" "gpt 5 2025 08 07", // tiktoken model name "tokenspername" 1, // openai chatmessage tokenization config "tokenspermessage" 4 // openai chatmessage tokenization config } cost inference for reasoning models cost inference by tokenizing the llm input and output is not supported for reasoning models such as the openai o1 model family that is, if no token counts are ingested, abv can not infer cost for reasoning models reasoning models take multiple steps to arrive to a response the result from each step generates reasoning tokens that are billed as output tokens so the cost effective output token count is the sum of all reasoning tokens and the token count for the final completion since abv does not have visibility into the reasoning tokens, it cannot infer the correct cost for generations that have no token usage provided to benefit from abv cost tracking, please provide the token usage when ingesting o1 model generations for more details, see the openai guide https //platform openai com/docs/guides/reasoning on how reasoning models work troubleshooting usage and cost are missing for historical generations except for changes in prices, abv does not retroactively infer usage and cost for existing generations when model definitions are changed you can request a batch job to apply new model definitions to existing generations