Observability Data Model

Core Concepts at a Glance

ABV’s observability model has four main building blocks:

Concept	What It Is	When to Use	Example
Trace	Single request or operation	Every API call or workflow	User asks a question in your chatbot
Observation	Individual steps within a trace	Track specific operations	LLM call, database query, function execution
Session	Group of related traces	Multi-turn interactions	Entire conversation thread
Score	Evaluation metric	Measure quality or performance	Accuracy score, cost, latency

Data Model Visualization

The following diagram shows how ABV’s core concepts relate to each other: Key relationships:

Sessions group multiple traces (one-to-many)
Traces contain multiple observations (one-to-many)
Observations can be nested hierarchically
Scores evaluate traces, observations, or sessions

Traces and Observations

Traces

A trace typically represents a single request or operation. It contains the overall input and output of the function, as well as metadata about the request ( i.e. user, session, tags, etc.).

Observations

Each trace can contain multiple observations to log the individual steps of the execution. Usually, a trace corresponds to a single api call of an application. Types Events are the basic building blocks. They are used to track discrete events in a trace. Spans represent durations of units of work in a trace. Generations are spans used to log generations of AI models incl. prompts, token usage and costs. Nesting Observations can be nested to represent the hierarchical structure of your application. For example, a trace might contain a span for the entire request, which in turn contains a generation for an LLM call.

Sessions

Optionally, traces can be grouped into sessions. Sessions are used to group traces that are part of the same user interaction. A common example is a thread in a chat interface. Please refer to the Sessions documentation to add sessions to your traces.

Scores

Scores are flexible objects used to evaluate traces, observations, sessions and dataset runs. They can be:

Numeric, categorical, or boolean values
Associated with a trace, a session, or a dataset run (one and only one is required)
For trace level scores only: Linked to a specific observation within a trace (optional)
Annotated with comments for additional context
Validated against a score configuration schema (optional)

Typically, session-level scores are used for comprehensive evaluation of conversational experiences across multiple interactions, while trace-level scores are used for evaluation of a single interaction. Dataset run level scores are used for overall evaluation of a dataset run, e.g. precision, recall, F1-score. Please refer to the scores documentation to get started. For more details on score types and attributes, refer to the scores data model documentation.

Real-World Examples

Understanding how to apply ABV’s data model to different scenarios helps you instrument your application effectively.

Example 1: Simple Chatbot

Session: "User conversation thread #123"
  └─ Trace: "User message: 'What's the weather?'"
      └─ Generation: "OpenAI GPT-4 call"
  └─ Trace: "User message: 'And tomorrow?'"
      └─ Generation: "OpenAI GPT-4 call (with conversation history)"

Key takeaway: Each user message is a trace. The conversation is a session. The LLM call is a generation.

Example 2: RAG Pipeline

Trace: "User query: 'Explain quantum computing'"
  ├─ Span: "Retrieve relevant documents"
  │   └─ Event: "Found 5 matching documents"
  ├─ Span: "Rerank documents"
  │   └─ Event: "Selected top 3 documents"
  └─ Generation: "Generate answer with context"
      └─ Score: "Answer quality = 0.92"

Key takeaway: Complex workflows use nested observations. Spans for non-LLM operations, generations for LLM calls, events for discrete actions.

Example 3: Multi-Agent System

Trace: "Research task: 'Find competitors'"
  ├─ Span: "Planner agent decides strategy"
  │   └─ Generation: "GPT-4: Create research plan"
  ├─ Span: "Researcher agent executes"
  │   ├─ Generation: "GPT-4: Search query 1"
  │   ├─ Generation: "GPT-4: Search query 2"
  │   └─ Event: "Found 10 results"
  └─ Span: "Summarizer agent compiles results"
      └─ Generation: "Claude: Summarize findings"

Key takeaway: Each agent’s work is a span. LLM calls within agents are generations. The entire task is one trace. Now that you understand the data model, here’s how to implement it:

Python Quickstart

Start instrumenting your Python application

JS/TS Quickstart

Start instrumenting your JavaScript/TypeScript application

Sessions Guide

Learn how to group traces into sessions

Evaluations

Add scores to measure quality

Best practice: Start simple with basic traces, then add observations as you need more granularity. Add sessions when you have multi-turn interactions. Add scores when you’re ready to measure quality.

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

Core Concepts at a Glance

Data Model Visualization

Traces and Observations

Traces

Observations

Sessions

Scores

Real-World Examples

Example 1: Simple Chatbot

Example 2: RAG Pipeline

Example 3: Multi-Agent System

Python Quickstart

JS/TS Quickstart

Sessions Guide

Evaluations

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​Core Concepts at a Glance

​Data Model Visualization

​Traces and Observations

​Traces

​Observations

​Sessions

​Scores

​Real-World Examples

​Example 1: Simple Chatbot

​Example 2: RAG Pipeline

​Example 3: Multi-Agent System

​Related Docs

Python Quickstart

JS/TS Quickstart

Sessions Guide

Evaluations

Core Concepts at a Glance

Data Model Visualization

Traces and Observations

Traces

Observations

Sessions

Scores

Real-World Examples

Example 1: Simple Chatbot

Example 2: RAG Pipeline

Example 3: Multi-Agent System

Related Docs