Evaluations

LLM-as-a-Judge

10 min

llm as a judge is a technique to evaluate the quality of llm applications by using an llm as a judge the llm is given a trace or a dataset entry and asked to score and reason about the output the scores and reasoning are stored as scores data model docid\ hai73gnnamxtypyenpkef in abv why use llm as a judge? scalable & cost‑effective judge thousands of outputs quickly and cheaply versus human panels human‑like judgments captures nuance (helpfulness, safety, coherence) better than simple metrics, especially when rubric‑guided repeatable comparisons with a fixed rubric, you can rerun the same prompts to get consistent scores and short rationales set up step by step 1\) create a new llm as a judge evaluator navigate to the evaluators page and click on the create evaluator button 2\) set the default model next, you'll define the default model used for conducting the evaluations the default is used by every managed evaluator; custom templates may override it this step requires an llm connection to be set up please see llm connections docid\ udkfdo70m djvlxr3fsql for more information setup this default model needs to be set up once, though it can be changed at any point if needed change existing evaluators keep evaluating with the new model—historic results stay preserved structured output support it's crucial that the chosen default model supports structured output this is essential for our system to correctly interpret the evaluation results from the llm judge 3\) pick an evaluator now we select an evaluator there are two main ways managed evaluator abv ships a growing catalog of evaluators built and maintained by us and partners like ragas each evaluator captures best practice evaluation prompts for a specific quality dimension—e g hallucination , context relevance , toxicity , helpfulness ready to use no prompt writing required continuously expanded by adding oss partner maintained evaluators and more evaluator types in the future (e g regex based) custom evaluator when the library doesn't fit your specific needs, add your own draft an evaluation prompt with {{variables}} placeholders ( input , output , ground truth …) optional customize the score (0 1) and reasoning prompts to guide the llm in scoring optional pin a custom dedicated model for this evaluator if no custom model is specified, it will use the default evaluation model (see section 2) save → the evaluator can now be reused across your project 4\) choose which data to evaluate with your evaluator and model selected, you now specify which data to run the evaluations on you can chose between running on production tracing data or datasets during dataset runs live data evaluating live production traffic allows you to monitor the performance of your llm application in real time scope choose whether to run on new traces only and/or existing traces once (for backfilling) when in doubt, we recommend running on new traces filter narrow down the evaluation to a specific subset of data you're interested in you can filter by trace name, tags, userid and may more combine filters freely preview abv shows a sample of traces from the last 24 hours that match your current filters, allowing you to sanity check your selection sampling to manage costs and evaluation throughput, you can configure the evaluator to run on a percentage (e g , 5%) of the matched traces dataset runs llm as a judge evaluators can score the results of your dataset runs prompt experiments docid\ q l3isqw0w dt6oprd85j when running native dataset runs through the ui, you can simply select which evaluators you want to run these selected evaluators will then automatically execute on the data generated by your next dataset run remote dataset runs docid\ d5g6uw9tlan6qhyajepv1 before running remote dataset runs through the sdks, you will need to set up which evaluators you want to run in ui you will need to configure a running evaluator in the following format dataset filter which source dataset the evaluator should run on scope choose whether to target only new dataset runs and/or execute the evaluator on past dataset runs (for backfilling) sampling to manage costs and evaluation throughput, you can configure the evaluator to run on a percentage (e g , 5%) of dataset run items 5\) map variables & preview evaluation prompt you now need to teach abv which properties of your trace or dataset item represent the actual data to populate these variables for a sensible evaluation for instance, you might map your system's logged trace input to the prompt's {{input}} variable, and the llm response ie trace output to the prompt's {{output}} variable this mapping is crucial for ensuring the evaluation is sensible and relevant live data prompt preview as you configure the mapping, abv shows a live preview of the evaluation prompt populated with actual data this preview uses historical traces from the last 24 hours that matched your filters (from step 3) you can navigate through several example traces to see how their respective data fills the prompt, helping you build confidence that the mapping is correct jsonpath if the data is nested (e g , within a json object), you can use a jsonpath expression (like $ choices\[0] message content ) to precisely locate it dataset runs suggested mappings the system will often be able to autocomplete common mappings based on typical field names in datasets for example, if you're evaluating for correctness, and your prompt includes {{input}} , {{output}} , and {{ground truth}} variables, we would likely suggest mapping these to the trace input, trace output, and the dataset item's expected output respectively edit mappings you can easily edit these suggestions if your dataset schema differs you can map any properties of your dataset item (e g , input , expected output ) further, as dataset runs create traces under the hood, using the trace input/output as the evaluation input/output is a common pattern think of the trace output as your experiment run's outpu ✨ done! you have successfully set up an evaluator which will run on your data need custom logic? use the sdk instead—see custom scores docid\ ds lvefqrlpefayxhhtmh or an external pipeline example external pipeline example monitor & iterate as our system evaluates your data it writes the results as scores data model docid\ hai73gnnamxtypyenpkef you can then view logs check detailed logs for each evaluation, including status, any retry errors, and the full request/response bodies sent to the evaluation model use dashboards aggregate scores over time, filter by version or environment, and track the performance of your llm application take actions pause, resume, or delete an evaluator