Quickstart

This tutorial walks you through a complete evaluation workflow with a practical example. You’ll create a simple dataset, run evaluations, and view results—giving you hands-on experience before diving into advanced concepts.

Time: 20-30 minutesPrerequisites:

ABV SDK installed (Python or JS/TS)
API key configured
Basic understanding of traces

The Example: Email Response Quality

We’ll build an evaluation system for an AI that generates customer service email responses. Our goal is to measure:

Politeness: Is the tone professional and courteous?
Completeness: Does it address all customer questions?
Accuracy: Is the information correct?

This is a common real-world use case that demonstrates key evaluation concepts.

Step 1: Create a Simple Dataset

First, let’s create a small dataset with 3 test cases. Each case has an input (customer email) and expected qualities.

Python
TypeScript

from abvdev import ABV

abv = ABV()

# Create a dataset for email response evaluation
dataset = abv.datasets.create(
    name="email-responses-v1",
    description="Test cases for customer service email quality"
)

# Add test cases
test_cases = [
    {
        "input": {
            "customer_email": "My order #12345 arrived damaged. I need a refund.",
            "context": "Product: Coffee Maker, Price: $89.99"
        },
        "expected_output": {
            "polite": True,
            "addresses_refund": True,
            "mentions_order_number": True
        }
    },
    {
        "input": {
            "customer_email": "How do I reset my password? I've tried 3 times.",
            "context": "Customer tier: Premium"
        },
        "expected_output": {
            "polite": True,
            "provides_steps": True,
            "offers_additional_help": True
        }
    },
    {
        "input": {
            "customer_email": "Your website is terrible! Nothing works!",
            "context": "Customer tier: Free"
        },
        "expected_output": {
            "polite": True,  # Must stay professional even when customer isn't
            "acknowledges_frustration": True,
            "offers_specific_help": True
        }
    }
]

# Add each test case to the dataset
for i, case in enumerate(test_cases):
    dataset.create_item(
        input=case["input"],
        expected_output=case["expected_output"],
        metadata={"case_number": i + 1}
    )

print(f"✅ Created dataset with {len(test_cases)} test cases")
print(f"Dataset ID: {dataset.id}")

import { ABV } from "@abvdev/client";

const abv = new ABV();

// Create a dataset for email response evaluation
const dataset = await abv.datasets.create({
  name: "email-responses-v1",
  description: "Test cases for customer service email quality"
});

// Add test cases
const testCases = [
  {
    input: {
      customer_email: "My order #12345 arrived damaged. I need a refund.",
      context: "Product: Coffee Maker, Price: $89.99"
    },
    expected_output: {
      polite: true,
      addresses_refund: true,
      mentions_order_number: true
    }
  },
  {
    input: {
      customer_email: "How do I reset my password? I've tried 3 times.",
      context: "Customer tier: Premium"
    },
    expected_output: {
      polite: true,
      provides_steps: true,
      offers_additional_help: true
    }
  },
  {
    input: {
      customer_email: "Your website is terrible! Nothing works!",
      context: "Customer tier: Free"
    },
    expected_output: {
      polite: true,  // Must stay professional even when customer isn't
      acknowledges_frustration: true,
      offers_specific_help: true
    }
  }
];

// Add each test case to the dataset
for (let i = 0; i < testCases.length; i++) {
  await dataset.createItem({
    input: testCases[i].input,
    expected_output: testCases[i].expected_output,
    metadata: { case_number: i + 1 }
  });
}

console.log(`✅ Created dataset with ${testCases.length} test cases`);
console.log(`Dataset ID: ${dataset.id}`);

What we did:

Created a dataset named email-responses-v1
Added 3 test cases covering different scenarios (damaged product, password help, frustrated customer)
Each case has input data and expected qualities we want to verify

Step 2: Run Your First Evaluation

Now let’s run our AI model on each test case and evaluate the results.

Python
TypeScript

# Your AI function that generates email responses
def generate_email_response(customer_email: str, context: str) -> str:
    """
    This is your actual AI function. For this tutorial, we'll use a simple mock.
    In production, this would call your LLM (OpenAI, Anthropic, etc.)
    """
    # Mock response for demonstration
    # Replace this with your actual LLM call
    return f"Dear valued customer, thank you for contacting us about: {customer_email[:50]}..."

# Run the dataset through your AI
dataset_run = dataset.run(
    name="Initial evaluation",
    description="Testing baseline email response quality",
    metadata={"model": "gpt-4", "temperature": 0.7}
)

# Process each test case
for item in dataset.items:
    # Generate AI response
    input_data = item.input
    ai_response = generate_email_response(
        customer_email=input_data["customer_email"],
        context=input_data["context"]
    )

    # Record the output
    dataset_run.create_observation(
        dataset_item_id=item.id,
        output=ai_response,
        metadata={"input_case": item.metadata["case_number"]}
    )

print(f"✅ Completed evaluation run: {dataset_run.id}")
print("View results in the ABV platform!")

// Your AI function that generates email responses
function generateEmailResponse(customerEmail: string, context: string): string {
  /**
   * This is your actual AI function. For this tutorial, we'll use a simple mock.
   * In production, this would call your LLM (OpenAI, Anthropic, etc.)
   */
  // Mock response for demonstration
  // Replace this with your actual LLM call
  return `Dear valued customer, thank you for contacting us about: ${customerEmail.slice(0, 50)}...`;
}

// Run the dataset through your AI
const datasetRun = await dataset.run({
  name: "Initial evaluation",
  description: "Testing baseline email response quality",
  metadata: { model: "gpt-4", temperature: 0.7 }
});

// Process each test case
const items = await dataset.items();
for (const item of items) {
  // Generate AI response
  const inputData = item.input;
  const aiResponse = generateEmailResponse(
    inputData.customer_email,
    inputData.context
  );

  // Record the output
  await datasetRun.createObservation({
    dataset_item_id: item.id,
    output: aiResponse,
    metadata: { input_case: item.metadata.case_number }
  });
}

console.log(`✅ Completed evaluation run: ${datasetRun.id}`);
console.log("View results in the ABV platform!");

What we did:

Created a dataset run to track this specific evaluation
Generated AI responses for each test case
Recorded outputs so we can score them

Step 3: Score the Results

Now we’ll add scores to evaluate quality. We’ll use both automated and manual scoring.

Option A: LLM-as-a-Judge (Automated)

Use another LLM to evaluate the responses:

Python
TypeScript

# Score politeness using LLM-as-a-Judge
def score_politeness_llm(response: str) -> dict:
    """
    Use an LLM to judge if the response is polite
    """
    # In production, call your LLM API
    # For demo, we'll return a mock score
    score_value = 0.85  # 0-1 scale

    return {
        "value": score_value,
        "comment": "Response uses professional language and courteous tone"
    }

# Apply scores to each observation
for observation in dataset_run.observations:
    # Get politeness score
    politeness = score_politeness_llm(observation.output)

    # Record the score
    observation.score(
        name="politeness",
        value=politeness["value"],
        comment=politeness["comment"]
    )

print("✅ Scored all responses for politeness")

// Score politeness using LLM-as-a-Judge
function scorePolitenessLLM(response: string): { value: number; comment: string } {
  /**
   * Use an LLM to judge if the response is polite
   */
  // In production, call your LLM API
  // For demo, we'll return a mock score
  const scoreValue = 0.85;  // 0-1 scale

  return {
    value: scoreValue,
    comment: "Response uses professional language and courteous tone"
  };
}

// Apply scores to each observation
const observations = await datasetRun.observations();
for (const observation of observations) {
  // Get politeness score
  const politeness = scorePolitenesLLM(observation.output);

  // Record the score
  await observation.score({
    name: "politeness",
    value: politeness.value,
    comment: politeness.comment
  });
}

console.log("✅ Scored all responses for politeness");

Option B: Custom Scoring (Rule-Based)

Create simple rules to check specific criteria:

Python
TypeScript

def score_completeness(response: str, expected: dict) -> float:
    """
    Check if response addresses expected points
    Simple keyword matching for demonstration
    """
    score = 0
    checks = 0

    # Check each expected criterion
    for key, should_have in expected.items():
        checks += 1
        if key == "addresses_refund" and "refund" in response.lower():
            score += 1
        elif key == "mentions_order_number" and "#" in response:
            score += 1
        elif key == "provides_steps" and any(word in response.lower() for word in ["step", "first", "then"]):
            score += 1

    return score / checks if checks > 0 else 0

# Apply custom scores
for observation in dataset_run.observations:
    # Get the expected output for this case
    item = dataset.get_item(observation.dataset_item_id)
    expected = item.expected_output

    # Calculate completeness score
    completeness = score_completeness(observation.output, expected)

    # Record the score
    observation.score(
        name="completeness",
        value=completeness,
        comment=f"Addressed {int(completeness * 100)}% of expected points"
    )

print("✅ Scored all responses for completeness")

function scoreCompleteness(response: string, expected: Record<string, boolean>): number {
  /**
   * Check if response addresses expected points
   * Simple keyword matching for demonstration
   */
  let score = 0;
  let checks = 0;

  // Check each expected criterion
  for (const [key, shouldHave] of Object.entries(expected)) {
    checks += 1;
    if (key === "addresses_refund" && response.toLowerCase().includes("refund")) {
      score += 1;
    } else if (key === "mentions_order_number" && response.includes("#")) {
      score += 1;
    } else if (key === "provides_steps" && ["step", "first", "then"].some(word => response.toLowerCase().includes(word))) {
      score += 1;
    }
  }

  return checks > 0 ? score / checks : 0;
}

// Apply custom scores
for (const observation of observations) {
  // Get the expected output for this case
  const item = await dataset.getItem(observation.dataset_item_id);
  const expected = item.expected_output;

  // Calculate completeness score
  const completeness = scoreCompleteness(observation.output, expected);

  // Record the score
  await observation.score({
    name: "completeness",
    value: completeness,
    comment: `Addressed ${Math.round(completeness * 100)}% of expected points`
  });
}

console.log("✅ Scored all responses for completeness");

Step 4: View Results in the Platform

Now go to the ABV platform to see your results:

Navigate to Evaluations → Datasets
Find your dataset: email-responses-v1
Click on the run: “Initial evaluation”
Review scores for each test case

You’ll see:

Overall score statistics (average, min, max)
Individual test case results
Scores breakdown by dimension (politeness, completeness)
Comments explaining each score

What to look for:

Which test cases scored lowest? Those need attention.
Are scores consistent across dimensions?
Do comments explain the reasoning clearly?

Step 5: Iterate and Improve

Based on the results, let’s improve our AI and re-evaluate:

Python
TypeScript

# Improved version with better prompting
def generate_email_response_v2(customer_email: str, context: str) -> str:
    """
    Version 2: Improved prompt engineering
    """
    # In production, use a better prompt like:
    prompt = f"""Generate a professional customer service email response.

Customer email: {customer_email}
Context: {context}

Requirements:
- Be polite and professional
- Address all customer concerns
- Offer specific next steps
- Show empathy when appropriate

Response:"""

    # Mock improved response
    return "Dear Customer, I sincerely apologize for the issue with your order..."

# Run evaluation again with v2
dataset_run_v2 = dataset.run(
    name="Improved model (v2)",
    description="Testing with better prompting",
    metadata={"model": "gpt-4", "temperature": 0.7, "version": "v2"}
)

for item in dataset.items:
    input_data = item.input
    ai_response = generate_email_response_v2(
        customer_email=input_data["customer_email"],
        context=input_data["context"]
    )

    dataset_run_v2.create_observation(
        dataset_item_id=item.id,
        output=ai_response
    )

# Score the new version
# ... (same scoring code as before)

print("✅ Completed v2 evaluation - compare results in the platform!")

// Improved version with better prompting
function generateEmailResponseV2(customerEmail: string, context: string): string {
  /**
   * Version 2: Improved prompt engineering
   */
  // In production, use a better prompt like:
  const prompt = `Generate a professional customer service email response.

Customer email: ${customerEmail}
Context: ${context}

Requirements:
- Be polite and professional
- Address all customer concerns
- Offer specific next steps
- Show empathy when appropriate

Response:`;

  // Mock improved response
  return "Dear Customer, I sincerely apologize for the issue with your order...";
}

// Run evaluation again with v2
const datasetRunV2 = await dataset.run({
  name: "Improved model (v2)",
  description: "Testing with better prompting",
  metadata: { model: "gpt-4", temperature: 0.7, version: "v2" }
});

for (const item of await dataset.items()) {
  const inputData = item.input;
  const aiResponse = generateEmailResponseV2(
    inputData.customer_email,
    inputData.context
  );

  await datasetRunV2.createObservation({
    dataset_item_id: item.id,
    output: aiResponse
  });
}

// Score the new version
// ... (same scoring code as before)

console.log("✅ Completed v2 evaluation - compare results in the platform!");

Now you can compare the two runs side-by-side in the platform to see improvements!

What You’ve Learned

Congratulations! You’ve completed a full evaluation workflow: ✅ Created a dataset with realistic test cases ✅ Ran evaluations on your AI outputs ✅ Applied scores using both LLM-as-a-Judge and custom rules ✅ Viewed results in the ABV platform ✅ Iterated by running a second evaluation with improvements

Expand Your Dataset

Add 10-20 more test cases covering edge cases
Include examples from production errors
Balance positive and negative examples

Read: Datasets Best Practices

Advanced Scoring

Use LLM-as-a-Judge with custom criteria
Add Human Annotation for nuanced scoring
Create Custom Scores for domain-specific metrics

Automation

Set up Remote Dataset Runs to run nightly
Use Prompt Experiments for A/B testing
Track Scores Data Model over time

Integration

Link evaluations to production traces
Export data for model fine-tuning
Build custom dashboards tracking quality metrics

Common Questions

How many test cases should I have?

Start with 10-20 high-quality test cases covering key scenarios. Add more as you find gaps. Quality > quantity—focus on representative, real-world examples.

Should I use LLM-as-a-Judge or custom scoring?

Use both! Custom scoring is fast and deterministic for clear criteria (e.g., “contains word X”). LLM-as-a-Judge is better for nuanced qualities like tone or helpfulness.

How often should I run evaluations?

During development: Every time you change prompts or models
In production: Daily or weekly automated runs
For experiments: Before and after each change to measure impact

Can I evaluate production traces?

Yes! You can score production traces directly without creating datasets. This is called “online evaluation.” See Evaluations Overview for details.

Troubleshooting

Dataset not showing up?

Wait a few seconds for the data to sync
Check your API key permissions
Verify you’re looking in the correct project

Scores not calculating?

Ensure observations are created before scoring
Check that score values are between 0 and 1
Verify your scoring function returns the correct format

Can’t compare runs?

Both runs must be on the same dataset
Scores must have the same names across runs
Allow time for score aggregation to complete

For more help, see Evaluations Troubleshooting FAQ

Complete Example Code

Want the full working example? Here’s everything in one place:

Python
TypeScript

from abvdev import ABV

abv = ABV()

# 1. Create dataset
dataset = abv.datasets.create(
    name="email-responses-tutorial",
    description="Quickstart tutorial dataset"
)

# 2. Add test cases
test_cases = [
    {
        "input": {"customer_email": "Order damaged, need refund", "context": "Product: Coffee Maker"},
        "expected_output": {"polite": True, "addresses_refund": True}
    },
    {
        "input": {"customer_email": "Password reset help", "context": "Premium customer"},
        "expected_output": {"polite": True, "provides_steps": True}
    }
]

for case in test_cases:
    dataset.create_item(input=case["input"], expected_output=case["expected_output"])

# 3. Run evaluation
dataset_run = dataset.run(name="Tutorial run")
for item in dataset.items:
    response = f"Mock response to: {item.input['customer_email']}"
    dataset_run.create_observation(dataset_item_id=item.id, output=response)

# 4. Add scores
for observation in dataset_run.observations:
    observation.score(name="quality", value=0.8, comment="Good response")

print(f"✅ Complete! View results for dataset run: {dataset_run.id}")

import { ABV } from "@abvdev/client";

const abv = new ABV();

async function runTutorial() {
  // 1. Create dataset
  const dataset = await abv.datasets.create({
    name: "email-responses-tutorial",
    description: "Quickstart tutorial dataset"
  });

  // 2. Add test cases
  const testCases = [
    {
      input: { customer_email: "Order damaged, need refund", context: "Product: Coffee Maker" },
      expected_output: { polite: true, addresses_refund: true }
    },
    {
      input: { customer_email: "Password reset help", context: "Premium customer" },
      expected_output: { polite: true, provides_steps: true }
    }
  ];

  for (const testCase of testCases) {
    await dataset.createItem({
      input: testCase.input,
      expected_output: testCase.expected_output
    });
  }

  // 3. Run evaluation
  const datasetRun = await dataset.run({ name: "Tutorial run" });
  const items = await dataset.items();

  for (const item of items) {
    const response = `Mock response to: ${item.input.customer_email}`;
    await datasetRun.createObservation({
      dataset_item_id: item.id,
      output: response
    });
  }

  // 4. Add scores
  const observations = await datasetRun.observations();
  for (const observation of observations) {
    await observation.score({
      name: "quality",
      value: 0.8,
      comment: "Good response"
    });
  }

  console.log(`✅ Complete! View results for dataset run: ${datasetRun.id}`);
}

runTutorial();

You’re now ready to build comprehensive evaluation systems for your LLM applications! 🎉

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

The Example: Email Response Quality

Step 1: Create a Simple Dataset

Step 2: Run Your First Evaluation

Step 3: Score the Results

Option A: LLM-as-a-Judge (Automated)

Option B: Custom Scoring (Rule-Based)

Step 4: View Results in the Platform

Step 5: Iterate and Improve

What You’ve Learned

Expand Your Dataset

Advanced Scoring

Automation

Integration

Common Questions

Troubleshooting

Complete Example Code

Getting Started

Basic Features

LLM Gateway

Guardrails

Evaluations

Prompt Management

Cookbook

SDKs

Platform

Support

​The Example: Email Response Quality

​Step 1: Create a Simple Dataset

​Step 2: Run Your First Evaluation

​Step 3: Score the Results

​Option A: LLM-as-a-Judge (Automated)

​Option B: Custom Scoring (Rule-Based)

​Step 4: View Results in the Platform

​Step 5: Iterate and Improve

​What You’ve Learned

​Related Topics

​Expand Your Dataset

​Advanced Scoring

​Automation

​Integration

​Common Questions

​Troubleshooting

​Complete Example Code

The Example: Email Response Quality

Step 1: Create a Simple Dataset

Step 2: Run Your First Evaluation

Step 3: Score the Results

Option A: LLM-as-a-Judge (Automated)

Option B: Custom Scoring (Rule-Based)

Step 4: View Results in the Platform

Step 5: Iterate and Improve

What You’ve Learned

Related Topics

Expand Your Dataset

Advanced Scoring

Automation

Integration

Common Questions

Troubleshooting

Complete Example Code