Skip to main content
This tutorial walks you through a complete evaluation workflow with a practical example. You’ll create a simple dataset, run evaluations, and view resultsβ€”giving you hands-on experience before diving into advanced concepts.
Time: 20-30 minutesPrerequisites:
  • ABV SDK installed (Python or JS/TS)
  • API key configured
  • Basic understanding of traces

The Example: Email Response Quality

We’ll build an evaluation system for an AI that generates customer service email responses. Our goal is to measure:
  • Politeness: Is the tone professional and courteous?
  • Completeness: Does it address all customer questions?
  • Accuracy: Is the information correct?
This is a common real-world use case that demonstrates key evaluation concepts.

Step 1: Create a Simple Dataset

First, let’s create a small dataset with 3 test cases. Each case has an input (customer email) and expected qualities.
from abvdev import ABV

abv = ABV()

# Create a dataset for email response evaluation
dataset = abv.datasets.create(
    name="email-responses-v1",
    description="Test cases for customer service email quality"
)

# Add test cases
test_cases = [
    {
        "input": {
            "customer_email": "My order #12345 arrived damaged. I need a refund.",
            "context": "Product: Coffee Maker, Price: $89.99"
        },
        "expected_output": {
            "polite": True,
            "addresses_refund": True,
            "mentions_order_number": True
        }
    },
    {
        "input": {
            "customer_email": "How do I reset my password? I've tried 3 times.",
            "context": "Customer tier: Premium"
        },
        "expected_output": {
            "polite": True,
            "provides_steps": True,
            "offers_additional_help": True
        }
    },
    {
        "input": {
            "customer_email": "Your website is terrible! Nothing works!",
            "context": "Customer tier: Free"
        },
        "expected_output": {
            "polite": True,  # Must stay professional even when customer isn't
            "acknowledges_frustration": True,
            "offers_specific_help": True
        }
    }
]

# Add each test case to the dataset
for i, case in enumerate(test_cases):
    dataset.create_item(
        input=case["input"],
        expected_output=case["expected_output"],
        metadata={"case_number": i + 1}
    )

print(f"βœ… Created dataset with {len(test_cases)} test cases")
print(f"Dataset ID: {dataset.id}")
What we did:
  • Created a dataset named email-responses-v1
  • Added 3 test cases covering different scenarios (damaged product, password help, frustrated customer)
  • Each case has input data and expected qualities we want to verify

Step 2: Run Your First Evaluation

Now let’s run our AI model on each test case and evaluate the results.
# Your AI function that generates email responses
def generate_email_response(customer_email: str, context: str) -> str:
    """
    This is your actual AI function. For this tutorial, we'll use a simple mock.
    In production, this would call your LLM (OpenAI, Anthropic, etc.)
    """
    # Mock response for demonstration
    # Replace this with your actual LLM call
    return f"Dear valued customer, thank you for contacting us about: {customer_email[:50]}..."

# Run the dataset through your AI
dataset_run = dataset.run(
    name="Initial evaluation",
    description="Testing baseline email response quality",
    metadata={"model": "gpt-4", "temperature": 0.7}
)

# Process each test case
for item in dataset.items:
    # Generate AI response
    input_data = item.input
    ai_response = generate_email_response(
        customer_email=input_data["customer_email"],
        context=input_data["context"]
    )

    # Record the output
    dataset_run.create_observation(
        dataset_item_id=item.id,
        output=ai_response,
        metadata={"input_case": item.metadata["case_number"]}
    )

print(f"βœ… Completed evaluation run: {dataset_run.id}")
print("View results in the ABV platform!")
What we did:
  • Created a dataset run to track this specific evaluation
  • Generated AI responses for each test case
  • Recorded outputs so we can score them

Step 3: Score the Results

Now we’ll add scores to evaluate quality. We’ll use both automated and manual scoring.

Option A: LLM-as-a-Judge (Automated)

Use another LLM to evaluate the responses:
# Score politeness using LLM-as-a-Judge
def score_politeness_llm(response: str) -> dict:
    """
    Use an LLM to judge if the response is polite
    """
    # In production, call your LLM API
    # For demo, we'll return a mock score
    score_value = 0.85  # 0-1 scale

    return {
        "value": score_value,
        "comment": "Response uses professional language and courteous tone"
    }

# Apply scores to each observation
for observation in dataset_run.observations:
    # Get politeness score
    politeness = score_politeness_llm(observation.output)

    # Record the score
    observation.score(
        name="politeness",
        value=politeness["value"],
        comment=politeness["comment"]
    )

print("βœ… Scored all responses for politeness")

Option B: Custom Scoring (Rule-Based)

Create simple rules to check specific criteria:
def score_completeness(response: str, expected: dict) -> float:
    """
    Check if response addresses expected points
    Simple keyword matching for demonstration
    """
    score = 0
    checks = 0

    # Check each expected criterion
    for key, should_have in expected.items():
        checks += 1
        if key == "addresses_refund" and "refund" in response.lower():
            score += 1
        elif key == "mentions_order_number" and "#" in response:
            score += 1
        elif key == "provides_steps" and any(word in response.lower() for word in ["step", "first", "then"]):
            score += 1

    return score / checks if checks > 0 else 0

# Apply custom scores
for observation in dataset_run.observations:
    # Get the expected output for this case
    item = dataset.get_item(observation.dataset_item_id)
    expected = item.expected_output

    # Calculate completeness score
    completeness = score_completeness(observation.output, expected)

    # Record the score
    observation.score(
        name="completeness",
        value=completeness,
        comment=f"Addressed {int(completeness * 100)}% of expected points"
    )

print("βœ… Scored all responses for completeness")

Step 4: View Results in the Platform

Now go to the ABV platform to see your results:
  1. Navigate to Evaluations β†’ Datasets
  2. Find your dataset: email-responses-v1
  3. Click on the run: β€œInitial evaluation”
  4. Review scores for each test case
You’ll see:
  • Overall score statistics (average, min, max)
  • Individual test case results
  • Scores breakdown by dimension (politeness, completeness)
  • Comments explaining each score
What to look for:
  • Which test cases scored lowest? Those need attention.
  • Are scores consistent across dimensions?
  • Do comments explain the reasoning clearly?

Step 5: Iterate and Improve

Based on the results, let’s improve our AI and re-evaluate:
# Improved version with better prompting
def generate_email_response_v2(customer_email: str, context: str) -> str:
    """
    Version 2: Improved prompt engineering
    """
    # In production, use a better prompt like:
    prompt = f"""Generate a professional customer service email response.

Customer email: {customer_email}
Context: {context}

Requirements:
- Be polite and professional
- Address all customer concerns
- Offer specific next steps
- Show empathy when appropriate

Response:"""

    # Mock improved response
    return "Dear Customer, I sincerely apologize for the issue with your order..."

# Run evaluation again with v2
dataset_run_v2 = dataset.run(
    name="Improved model (v2)",
    description="Testing with better prompting",
    metadata={"model": "gpt-4", "temperature": 0.7, "version": "v2"}
)

for item in dataset.items:
    input_data = item.input
    ai_response = generate_email_response_v2(
        customer_email=input_data["customer_email"],
        context=input_data["context"]
    )

    dataset_run_v2.create_observation(
        dataset_item_id=item.id,
        output=ai_response
    )

# Score the new version
# ... (same scoring code as before)

print("βœ… Completed v2 evaluation - compare results in the platform!")
Now you can compare the two runs side-by-side in the platform to see improvements!

What You’ve Learned

Congratulations! You’ve completed a full evaluation workflow: βœ… Created a dataset with realistic test cases βœ… Ran evaluations on your AI outputs βœ… Applied scores using both LLM-as-a-Judge and custom rules βœ… Viewed results in the ABV platform βœ… Iterated by running a second evaluation with improvements

Expand Your Dataset

  • Add 10-20 more test cases covering edge cases
  • Include examples from production errors
  • Balance positive and negative examples
Read: Datasets Best Practices

Advanced Scoring

Automation

Integration


Common Questions

Start with 10-20 high-quality test cases covering key scenarios. Add more as you find gaps. Quality > quantityβ€”focus on representative, real-world examples.
Use both! Custom scoring is fast and deterministic for clear criteria (e.g., β€œcontains word X”). LLM-as-a-Judge is better for nuanced qualities like tone or helpfulness.
  • During development: Every time you change prompts or models
  • In production: Daily or weekly automated runs
  • For experiments: Before and after each change to measure impact
Yes! You can score production traces directly without creating datasets. This is called β€œonline evaluation.” See Evaluations Overview for details.

Troubleshooting

Dataset not showing up?
  • Wait a few seconds for the data to sync
  • Check your API key permissions
  • Verify you’re looking in the correct project
Scores not calculating?
  • Ensure observations are created before scoring
  • Check that score values are between 0 and 1
  • Verify your scoring function returns the correct format
Can’t compare runs?
  • Both runs must be on the same dataset
  • Scores must have the same names across runs
  • Allow time for score aggregation to complete
For more help, see Evaluations Troubleshooting FAQ

Complete Example Code

Want the full working example? Here’s everything in one place:
from abvdev import ABV

abv = ABV()

# 1. Create dataset
dataset = abv.datasets.create(
    name="email-responses-tutorial",
    description="Quickstart tutorial dataset"
)

# 2. Add test cases
test_cases = [
    {
        "input": {"customer_email": "Order damaged, need refund", "context": "Product: Coffee Maker"},
        "expected_output": {"polite": True, "addresses_refund": True}
    },
    {
        "input": {"customer_email": "Password reset help", "context": "Premium customer"},
        "expected_output": {"polite": True, "provides_steps": True}
    }
]

for case in test_cases:
    dataset.create_item(input=case["input"], expected_output=case["expected_output"])

# 3. Run evaluation
dataset_run = dataset.run(name="Tutorial run")
for item in dataset.items:
    response = f"Mock response to: {item.input['customer_email']}"
    dataset_run.create_observation(dataset_item_id=item.id, output=response)

# 4. Add scores
for observation in dataset_run.observations:
    observation.score(name="quality", value=0.8, comment="Good response")

print(f"βœ… Complete! View results for dataset run: {dataset_run.id}")

You’re now ready to build comprehensive evaluation systems for your LLM applications! πŸŽ‰