Core Concepts at a Glance
ABV’s observability model has four main building blocks:
| Concept | What It Is | When to Use | Example |
|---|
| Trace | Single request or operation | Every API call or workflow | User asks a question in your chatbot |
| Observation | Individual steps within a trace | Track specific operations | LLM call, database query, function execution |
| Session | Group of related traces | Multi-turn interactions | Entire conversation thread |
| Score | Evaluation metric | Measure quality or performance | Accuracy score, cost, latency |
Data Model Visualization
The following diagram shows how ABV’s core concepts relate to each other:
Key relationships:
- Sessions group multiple traces (one-to-many)
- Traces contain multiple observations (one-to-many)
- Observations can be nested hierarchically
- Scores evaluate traces, observations, or sessions
Traces and Observations
Traces
A trace typically represents a single request or operation. It contains the overall input and output of the function, as well as metadata about the request ( i.e. user, session, tags, etc.).
Observations
Each trace can contain multiple observations to log the individual steps of the execution. Usually, a trace corresponds to a single api call of an application.
Types
Events are the basic building blocks. They are used to track discrete events in a trace.
Spans represent durations of units of work in a trace.
Generations are spans used to log generations of AI models incl. prompts, token usage and costs.
Nesting
Observations can be nested to represent the hierarchical structure of your application. For example, a trace might contain a span for the entire request, which in turn contains a generation for an LLM call.
Sessions
Optionally, traces can be grouped into sessions. Sessions are used to group traces that are part of the same user interaction. A common example is a thread in a chat interface.
Please refer to the Sessions documentation to add sessions to your traces.
Scores
Scores are flexible objects used to evaluate traces, observations, sessions and dataset runs.
They can be:
- Numeric, categorical, or boolean values
- Associated with a trace, a session, or a dataset run (one and only one is required)
- For trace level scores only: Linked to a specific observation within a trace (optional)
- Annotated with comments for additional context
- Validated against a score configuration schema (optional)
Typically, session-level scores are used for comprehensive evaluation of conversational experiences across multiple interactions, while trace-level scores are used for evaluation of a single interaction. Dataset run level scores are used for overall evaluation of a dataset run, e.g. precision, recall, F1-score.
Please refer to the scores documentation to get started. For more details on score types and attributes, refer to the scores data model documentation.
Real-World Examples
Understanding how to apply ABV’s data model to different scenarios helps you instrument your application effectively.
Example 1: Simple Chatbot
Session: "User conversation thread #123"
└─ Trace: "User message: 'What's the weather?'"
└─ Generation: "OpenAI GPT-4 call"
└─ Trace: "User message: 'And tomorrow?'"
└─ Generation: "OpenAI GPT-4 call (with conversation history)"
Key takeaway: Each user message is a trace. The conversation is a session. The LLM call is a generation.
Example 2: RAG Pipeline
Trace: "User query: 'Explain quantum computing'"
├─ Span: "Retrieve relevant documents"
│ └─ Event: "Found 5 matching documents"
├─ Span: "Rerank documents"
│ └─ Event: "Selected top 3 documents"
└─ Generation: "Generate answer with context"
└─ Score: "Answer quality = 0.92"
Key takeaway: Complex workflows use nested observations. Spans for non-LLM operations, generations for LLM calls, events for discrete actions.
Example 3: Multi-Agent System
Trace: "Research task: 'Find competitors'"
├─ Span: "Planner agent decides strategy"
│ └─ Generation: "GPT-4: Create research plan"
├─ Span: "Researcher agent executes"
│ ├─ Generation: "GPT-4: Search query 1"
│ ├─ Generation: "GPT-4: Search query 2"
│ └─ Event: "Found 10 results"
└─ Span: "Summarizer agent compiles results"
└─ Generation: "Claude: Summarize findings"
Key takeaway: Each agent’s work is a span. LLM calls within agents are generations. The entire task is one trace.
Now that you understand the data model, here’s how to implement it:
Best practice: Start simple with basic traces, then add observations as you need more granularity. Add sessions when you have multi-turn interactions. Add scores when you’re ready to measure quality.