> ## Documentation Index
> Fetch the complete documentation index at: https://docs.abv.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

Evaluation is a critical aspect of developing and deploying LLM applications. Usually, teams use a multitude of different evaluation methods to score the performance of their AI application depending on the use case and the stage of the development process.

# Why use LLM Evaluation?

LLM evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing the user experience and trust in your AI application. Here are the key benefits:

* **Quality Assurance**: Detect hallucinations, factual inaccuracies, and inconsistent outputs to ensure your AI app delivers reliable results
* **Performance Monitoring**: Measure response quality, relevance, and user satisfaction across different scenarios and edge cases
* **Continuous Improvement**: Identify areas for enhancement and track improvements over time through structured evaluation metrics
* **User Trust**: Build confidence in your AI application by demonstrating consistent, high-quality outputs through systematic evaluation
* **Risk Mitigation**: Catch potential issues before they reach production users, reducing the likelihood of poor user experiences or reputational damage

# Online & Offline Evaluation

**Offline Evaluation involves**

* Evaluating the application in a controlled setting
* Typically using curated test Datasets instead of live user queries
* Heavily used during development (can be part of CI/CD pipelines) to measure improvements / regressions
* Repeatable and you can get clear accuracy metrics since you have ground truth.

**Online Evaluation involves**

* Evaluating the application in a live, real-world environment, i.e. during actual usage in production.
* Use Evaluation Methods that track success rates, user satisfaction scores, or other metrics on live traffic
* Advantage of online evaluation is that it captures things you might not anticipate in a lab setting
* Can include collecting implicit and explicit user feedback, and possibly running shadow tests or A/B tests

**In practice, successful evaluation blends online and offline evaluations.** Many teams adopt a loop-like approach. This way, evaluation is continuous and ever-improving.

```mermaid theme={null}
graph TB
    subgraph Offline["Offline Evaluation (Development)"]
        Dataset[Test Dataset]
        RunDataset[Run Dataset through LLM]
        Evaluate[Apply Evaluation Methods<br/>LLM-as-Judge, Human, Custom]
        Scores1[Generate Scores]
        Analyze1[Analyze Results]
        Improve[Improve Prompts/Models]
    end

    subgraph Online["Online Evaluation (Production)"]
        ProdTraffic[Live Production Traffic]
        RealTime[Real-time Monitoring]
        UserFeedback[Collect User Feedback]
        Scores2[Generate Scores]
        Analyze2[Monitor Metrics]
        Issues[Identify Issues]
    end

    Dataset --> RunDataset
    RunDataset --> Evaluate
    Evaluate --> Scores1
    Scores1 --> Analyze1
    Analyze1 --> Improve
    Improve -.->|Deploy| ProdTraffic

    ProdTraffic --> RealTime
    RealTime --> UserFeedback
    UserFeedback --> Scores2
    Scores2 --> Analyze2
    Analyze2 --> Issues
    Issues -.->|Create new test cases| Dataset

    classDef datasetClass fill:#4fc3f7,stroke:#0288d1,color:#000
    classDef scoreClass fill:#81c784,stroke:#388e3c,color:#000
    classDef prodClass fill:#ffb74d,stroke:#f57c00,color:#000
    classDef improveClass fill:#ba68c8,stroke:#8e24aa,color:#000
    classDef issueClass fill:#e57373,stroke:#c62828,color:#000

    class Dataset,RunDataset,Evaluate datasetClass
    class Scores1,Scores2,Analyze1,Analyze2 scoreClass
    class ProdTraffic,RealTime,UserFeedback prodClass
    class Improve improveClass
    class Issues issueClass
```

**The evaluation loop:** Offline testing validates changes before deployment. Production monitoring surfaces real-world issues. Insights from production create new test cases for offline evaluation.

# Core Concepts

| Concept                | Description                                                                                                                 |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| **Scores**             | Scores are a flexible data object that can be used to store any evaluation metric and link it to other objects in ABV.      |
| **Evaluation Methods** | Evaluation methods are functions or tools to assign scores to other objects.                                                |
| **Datasets**           | Datasets are a collection of inputs and, optionally, expected outputs that can be used during Dataset runs.                 |
| **Dataset Runs**       | Dataset runs are used to run a dataset through your LLM application and optionally apply evaluation methods to the results. |

# Evaluation Methods

Evaluation methods are functions or tools to assign evaluation `Score`s to other objects. ABV uses the Scores to store evaluation metrics, it is meant to be flexible to represent any evaluation metric.

ABV currently supports: automatic scoring through **LLM-as-a-Judge**, manual **Human Annotations** or fully **Custom Scoring via API/SDKs**. We keep adding more evaluation methods fast, so stay tuned!

[LLM-as-a-Judge](./llm-as-a-judge)

[Human Annotation](./human-annotation)

[Custom Scores](./custom-scores)

Learn more about the [Scores Data Model](./scores-data-model).

# Dataset Runs

**Dataset Runs** are used to loop your LLM application through **Datasets** and **optionally apply Evaluation Methods** to the results. This lets you strategically evaluate your application and compare the performance of different inputs, prompts, models, or other parameters side-by-side against controlled conditions.

In ABV we differentiate between **Native** vs. **Remote** Dataset Runs. Native Dataset Runs rely on Dataset, Prompts and optionally LLM-as-a-Judge Evaluators all being on the ABV platform. Remote Dataset Runs rely only on Datasets being on the ABV platform, prompts and evaluation methods can managed off platform – they are run via code.

All require managing the [Datasets](/developer/evaluations/datasets) on the ABV platform.

[Create a Dataset](/developer/evaluations/datasets)

[Remote Dataset Runs](./remote-dataset-runs)

[Native Dataset Runs](/developer/evaluations/prompt-experiments)

Learn more about the [Dataset Runs Data Model](./dataset-runs-data-model).

## Integration with Other Features

Evaluations work seamlessly with other ABV features to provide comprehensive testing and monitoring:

* **Prompt Management**: Test different prompt versions using [Prompt Experiments](/developer/evaluations/prompt-experiments) to find the best performing prompts
* **Observability**: Evaluation scores appear directly on [traces](/developer/basic-features/observability-tracing) for real-time quality monitoring
* **SDK Support**: Create and manage evaluations programmatically with the [Python SDK](/developer/sdks/python/evaluations) and [JS/TS SDK](/developer/sdks/js-ts/overview)
* **Metrics Dashboard**: Aggregate evaluation scores in the [Metrics](/developer/platform/metrics/overview) section to track quality trends over time
* **Data Export**: Export evaluation results via the [Public API](/developer/platform/api-data-platform/public-api) for further analysis

## Getting Started

1. **Set up tracing**: Start by instrumenting your application with [Python SDK](/developer/quickstart-python) or [JS/TS SDK](/developer/quickstart-js-ts)
2. **Create datasets**: Build test datasets with representative inputs and expected outputs
3. **Choose evaluation methods**: Select from LLM-as-a-Judge, Human Annotation, or Custom Scores
4. **Run evaluations**: Execute dataset runs to evaluate your application systematically
5. **Monitor and iterate**: Track scores over time and improve based on insights