Evals Troubleshooting and FAQ

How do I create and manage Score Configs?

Score Configs ensure scores follow a specific schema and standardize scoring across your team.Create a Score Config:

Navigate to your project in the ABV UI
Go to Evaluations → Score Configs
Click Create Score Config
Configure:
- Name: e.g., user_feedback, hallucination_eval
- Data Type: NUMERIC, CATEGORICAL, or BOOLEAN
- Constraints: Min/Max for numeric, custom categories for categorical

Via API:

from abvdev import ABV

abv = ABV(api_key="sk-abv-...")

abv.create_score_config(
    name="correctness",
    data_type="NUMERIC",
    min_value=0,
    max_value=1,
    description="Measures factual accuracy"
)

Manage Configs:

Configs are immutable but can be archived
Archived configs can be restored anytime
Link scores to configs using config_id to ensure schema compliance

See Scores Data Model for complete details.

I don't see traces in the dashboard. How do I troubleshoot?

Common causes and solutions:

Events not flushed (short-lived apps):
- Python: Call abv.flush() before exit
- JS/TS: Call await abvSpanProcessor.forceFlush() before exit
Incorrect API credentials:
- Verify your API key is correct
- Check region (US: https://app.abv.dev, EU: https://eu.app.abv.dev)
- Python: Use abv.auth_check() to verify credentials
Instrumentation not loaded:
- JS/TS: Ensure import "./instrumentation" is the FIRST import
- Python: Initialize with get_client() or ABV()
Network/firewall issues:
- Verify your application can reach the ABV API
- Check for proxy/firewall blocking requests
Sampling too aggressive:
- Check if sampling is filtering out traces
- Temporarily set sample rate to 1.0 (100%) to test
Wrong project:
- Verify you’re viewing the correct project in the ABV UI
- Confirm API key belongs to the project you’re viewing
For JS/TS with @vercel/otel:
- Use manual OpenTelemetry setup via NodeTracerProvider
- The @vercel/otel package doesn’t support OpenTelemetry JS SDK v2
Enable debug logging:
- Python: Set log level in code
- JS/TS: Set ABV_LOG_LEVEL="DEBUG" in environment variables

See Troubleshooting FAQ for general troubleshooting.

How do I capture user feedback for evaluation?

Capture user feedback as scores to evaluate LLM application quality.Method 1: Frontend Collection (Browser SDK)

import { ABVClient } from '@abvdev/client';

const abv = new ABVClient({ apiKey: 'sk-abv-...' });

// Capture thumbs up/down
abv.createScore({
  name: 'user_feedback',
  value: 1,  // 1 for positive, 0 for negative
  traceId: traceId,
  dataType: 'BOOLEAN',
  comment: 'User found this helpful'
});

Method 2: Backend Collection (Python SDK)

from abvdev import ABV

abv = ABV(api_key="sk-abv-...")

# Categorical feedback
abv.create_score(
    name="user_rating",
    string_value="excellent",  # or "good", "poor"
    trace_id="trace_id_here",
    data_type="CATEGORICAL",
    comment="User provided detailed feedback"
)

Method 3: Human Annotation UIUse Annotation Queues for structured team reviews:

Create Score Configs for feedback dimensions
Create an Annotation Queue
Assign team members to review traces
Annotate traces directly in the ABV UI

Best Practices:

Link scores to Score Configs for consistent schema
Use trace_id to associate feedback with specific interactions
Scores can be ingested before the trace is created (linked automatically)

See Custom Scores and Human Annotation for details.

How do Score Configs ensure data consistency?

Score Configs enforce schema validation across your evaluation workflows.Benefits:

Standardized scoring: All team members use the same criteria
Data validation: Automatic validation of score values
Type safety: Ensures numeric/categorical/boolean consistency
Schema evolution: Archive old configs, create new versions

Example: Categorical Score Config

abv.create_score_config(
    name="sentiment",
    data_type="CATEGORICAL",
    categories=[
        {"label": "positive", "value": 1},
        {"label": "neutral", "value": 0},
        {"label": "negative", "value": -1}
    ]
)

When you create a score with this config_id, ABV validates that string_value matches one of the defined categories.Example: Numeric Score Config with Constraints

abv.create_score_config(
    name="accuracy",
    data_type="NUMERIC",
    min_value=0.0,
    max_value=1.0
)

Scores outside the 0-1 range will be rejected.See Scores Data Model for configuration options.

Can I use scores without Score Configs?

Yes, Score Configs are optional but recommended.Without Score Configs:

Manually specify data_type for each score
No automatic validation of value ranges
Less consistency across team members

Example:

abv.create_score(
    name="custom_metric",
    value=42,
    trace_id="trace_id",
    data_type="NUMERIC"  # Must specify manually
)

With Score Configs:

Reference config_id to automatically set data_type
Automatic value validation
Standardized across all scores with that name

abv.create_score(
    name="custom_metric",
    value=42,
    trace_id="trace_id",
    config_id="config_id_here"  # data_type set automatically
)

Recommendation: Use Score Configs for production evaluation workflows.

How do I link scores to traces, observations, or sessions?

Scores can be attached to different levels of your application data.Trace-level (most common):

abv.create_score(
    name="overall_quality",
    value=0.9,
    trace_id="trace_id_here"
)

Observation-level (specific LLM call):

abv.create_score(
    name="hallucination",
    value=0,
    observation_id="observation_id_here"
)

Session-level (multi-turn conversation):

abv.create_score(
    name="conversation_quality",
    value=0.85,
    session_id="session_id_here"
)

Dataset Run-level (experiment performance):

abv.create_score(
    name="experiment_accuracy",
    value=0.92,
    dataset_run_id="run_id_here"
)

Note: Each score references exactly one of these objects.See Scores Data Model for use cases.

What's the difference between API, EVAL, and ANNOTATION scores?

The source field automatically categorizes how scores were created:

Source	Description	Example Use Case
API	Scores created via SDK or API	User feedback, runtime metrics, custom evaluations
EVAL	Scores from LLM-as-a-Judge evaluations	Automated quality checks, hallucination detection
ANNOTATION	Scores from Human Annotation UI	Manual reviews, annotation queues, team collaboration

Automatic Assignment:

SDK/API calls → source="API"
LLM-as-a-Judge runs → source="EVAL"
UI annotations → source="ANNOTATION"

Filter by source:

View scores by source in the ABV UI
Query via API: abv.get_scores(source="EVAL")
Useful for comparing human vs automated evaluations

This helps track evaluation provenance and compare different evaluation methods.

Scores Data Model

Complete reference for scores and configs

Custom Scores

Implement custom evaluation workflows

Human Annotation

Team-based manual evaluation

LLM-as-a-Judge

Automated LLM evaluations

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

Scores Data Model

Custom Scores

Human Annotation

LLM-as-a-Judge

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​Related Resources

Scores Data Model

Custom Scores

Human Annotation

LLM-as-a-Judge

Related Resources