Skip to main content

Core Concepts at a Glance

ABV’s observability model has four main building blocks:
ConceptWhat It IsWhen to UseExample
TraceSingle request or operationEvery API call or workflowUser asks a question in your chatbot
ObservationIndividual steps within a traceTrack specific operationsLLM call, database query, function execution
SessionGroup of related tracesMulti-turn interactionsEntire conversation thread
ScoreEvaluation metricMeasure quality or performanceAccuracy score, cost, latency

Data Model Visualization

The following diagram shows how ABV’s core concepts relate to each other: Key relationships:
  • Sessions group multiple traces (one-to-many)
  • Traces contain multiple observations (one-to-many)
  • Observations can be nested hierarchically
  • Scores evaluate traces, observations, or sessions

Traces and Observations

Traces

A trace typically represents a single request or operation. It contains the overall input and output of the function, as well as metadata about the request ( i.e. user, session, tags, etc.).

Observations

Each trace can contain multiple observations to log the individual steps of the execution. Usually, a trace corresponds to a single api call of an application. Types Events are the basic building blocks. They are used to track discrete events in a trace. Spans represent durations of units of work in a trace. Generations are spans used to log generations of AI models incl. prompts, token usage and costs. Nesting Observations can be nested to represent the hierarchical structure of your application. For example, a trace might contain a span for the entire request, which in turn contains a generation for an LLM call.

Sessions

Optionally, traces can be grouped into sessions. Sessions are used to group traces that are part of the same user interaction. A common example is a thread in a chat interface. Please refer to the Sessions documentation to add sessions to your traces.

Scores

Scores are flexible objects used to evaluate traces, observations, sessions and dataset runs. They can be:
  • Numeric, categorical, or boolean values
  • Associated with a trace, a session, or a dataset run (one and only one is required)
  • For trace level scores only: Linked to a specific observation within a trace (optional)
  • Annotated with comments for additional context
  • Validated against a score configuration schema (optional)
Typically, session-level scores are used for comprehensive evaluation of conversational experiences across multiple interactions, while trace-level scores are used for evaluation of a single interaction. Dataset run level scores are used for overall evaluation of a dataset run, e.g. precision, recall, F1-score. Please refer to the scores documentation to get started. For more details on score types and attributes, refer to the scores data model documentation.

Real-World Examples

Understanding how to apply ABV’s data model to different scenarios helps you instrument your application effectively.

Example 1: Simple Chatbot

Session: "User conversation thread #123"
  └─ Trace: "User message: 'What's the weather?'"
      └─ Generation: "OpenAI GPT-4 call"
  └─ Trace: "User message: 'And tomorrow?'"
      └─ Generation: "OpenAI GPT-4 call (with conversation history)"
Key takeaway: Each user message is a trace. The conversation is a session. The LLM call is a generation.

Example 2: RAG Pipeline

Trace: "User query: 'Explain quantum computing'"
  ├─ Span: "Retrieve relevant documents"
  │   └─ Event: "Found 5 matching documents"
  ├─ Span: "Rerank documents"
  │   └─ Event: "Selected top 3 documents"
  └─ Generation: "Generate answer with context"
      └─ Score: "Answer quality = 0.92"
Key takeaway: Complex workflows use nested observations. Spans for non-LLM operations, generations for LLM calls, events for discrete actions.

Example 3: Multi-Agent System

Trace: "Research task: 'Find competitors'"
  ├─ Span: "Planner agent decides strategy"
  │   └─ Generation: "GPT-4: Create research plan"
  ├─ Span: "Researcher agent executes"
  │   ├─ Generation: "GPT-4: Search query 1"
  │   ├─ Generation: "GPT-4: Search query 2"
  │   └─ Event: "Found 10 results"
  └─ Span: "Summarizer agent compiles results"
      └─ Generation: "Claude: Summarize findings"
Key takeaway: Each agent’s work is a span. LLM calls within agents are generations. The entire task is one trace. Now that you understand the data model, here’s how to implement it:
Best practice: Start simple with basic traces, then add observations as you need more granularity. Add sessions when you have multi-turn interactions. Add scores when you’re ready to measure quality.