Evaluations

Evaluation Overview

4 min

evaluation is a critical aspect of developing and deploying llm applications usually, teams use a multitude of different evaluation methods to score the performance of their ai application depending on the use case and the stage of the development process why use llm evaluation? llm evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing the user experience and trust in your ai application here are the key benefits quality assurance detect hallucinations, factual inaccuracies, and inconsistent outputs to ensure your ai app delivers reliable results performance monitoring measure response quality, relevance, and user satisfaction across different scenarios and edge cases continuous improvement identify areas for enhancement and track improvements over time through structured evaluation metrics user trust build confidence in your ai application by demonstrating consistent, high quality outputs through systematic evaluation risk mitigation catch potential issues before they reach production users, reducing the likelihood of poor user experiences or reputational damage online & offline evaluation offline evaluation involves evaluating the application in a controlled setting typically using curated test datasets instead of live user queries heavily used during development (can be part of ci/cd pipelines) to measure improvements / regressions repeatable and you can get clear accuracy metrics since you have ground truth online evaluation involves evaluating the application in a live, real world environment, i e during actual usage in production use evaluation methods that track success rates, user satisfaction scores, or other metrics on live traffic advantage of online evaluation is that it captures things you might not anticipate in a lab setting can include collecting implicit and explicit user feedback, and possibly running shadow tests or a/b tests in practice, successful evaluation blends online and offline evaluations many teams adopt a loop like approach this way, evaluation is continuous and ever improving core concepts concept description scores scores are a flexible data object that can be used to store any evaluation metric and link it to other objects in abv evaluation methods evaluation methods are functions or tools to assign scores to other objects datasets datasets are a collection of inputs and, optionally, expected outputs that can be during dataset runs dataset runs dataset runs are used to run a dataset through your llm application and optionally apply evaluation methods to the results evaluation methods evaluation methods are functions or tools to assign evaluation score s to other objects abv uses the scores to store evaluation metrics, it is meant to be flexible to represent any evaluation metric abv currently supports automatic scoring through llm as a judge , manual human annotations or fully custom scoring via api/sdks we keep adding more evaluation methods fast, so stay tuned! llm as a judge docid\ cfcqzowatrqzfmon81bsw human annotation docid\ sus61b5z5hbx6njqctecn custom scores docid\ ds lvefqrlpefayxhhtmh learn more about the scores data model docid\ hai73gnnamxtypyenpkef dataset runs dataset runs are used to loop your llm application through datasets and optionally apply evaluation methods to the results this lets you strategically evaluate your application and compare the performance of different inputs, prompts, models, or other parameters side by side against controlled conditions in abv we differentiate between native vs remote dataset runs native dataset runs rely on dataset, prompts and optionally llm as a judge evaluators all being on the abv platform remote dataset runs rely only on datasets being on the abv platform, prompts and evaluation methods can managed off platform – they are run via code all require managing the datasets docid\ hpx5pmly6cp7be1dmacgn on the abv platform datasets docid\ hpx5pmly6cp7be1dmacgn remote dataset runs docid\ d5g6uw9tlan6qhyajepv1 prompt experiments docid\ q l3isqw0w dt6oprd85j learn more about the dataset runs data model docid\ rbvady5ck90rznygpf0yv