This tutorial walks you through a complete evaluation workflow with a practical example. Youβll create a simple dataset, run evaluations, and view resultsβgiving you hands-on experience before diving into advanced concepts.
First, letβs create a small dataset with 3 test cases. Each case has an input (customer email) and expected qualities.
Python
TypeScript
Copy
from abvdev import ABVabv = ABV()# Create a dataset for email response evaluationdataset = abv.datasets.create( name="email-responses-v1", description="Test cases for customer service email quality")# Add test casestest_cases = [ { "input": { "customer_email": "My order #12345 arrived damaged. I need a refund.", "context": "Product: Coffee Maker, Price: $89.99" }, "expected_output": { "polite": True, "addresses_refund": True, "mentions_order_number": True } }, { "input": { "customer_email": "How do I reset my password? I've tried 3 times.", "context": "Customer tier: Premium" }, "expected_output": { "polite": True, "provides_steps": True, "offers_additional_help": True } }, { "input": { "customer_email": "Your website is terrible! Nothing works!", "context": "Customer tier: Free" }, "expected_output": { "polite": True, # Must stay professional even when customer isn't "acknowledges_frustration": True, "offers_specific_help": True } }]# Add each test case to the datasetfor i, case in enumerate(test_cases): dataset.create_item( input=case["input"], expected_output=case["expected_output"], metadata={"case_number": i + 1} )print(f"β Created dataset with {len(test_cases)} test cases")print(f"Dataset ID: {dataset.id}")
Copy
import { ABV } from "@abvdev/client";const abv = new ABV();// Create a dataset for email response evaluationconst dataset = await abv.datasets.create({ name: "email-responses-v1", description: "Test cases for customer service email quality"});// Add test casesconst testCases = [ { input: { customer_email: "My order #12345 arrived damaged. I need a refund.", context: "Product: Coffee Maker, Price: $89.99" }, expected_output: { polite: true, addresses_refund: true, mentions_order_number: true } }, { input: { customer_email: "How do I reset my password? I've tried 3 times.", context: "Customer tier: Premium" }, expected_output: { polite: true, provides_steps: true, offers_additional_help: true } }, { input: { customer_email: "Your website is terrible! Nothing works!", context: "Customer tier: Free" }, expected_output: { polite: true, // Must stay professional even when customer isn't acknowledges_frustration: true, offers_specific_help: true } }];// Add each test case to the datasetfor (let i = 0; i < testCases.length; i++) { await dataset.createItem({ input: testCases[i].input, expected_output: testCases[i].expected_output, metadata: { case_number: i + 1 } });}console.log(`β Created dataset with ${testCases.length} test cases`);console.log(`Dataset ID: ${dataset.id}`);
What we did:
Created a dataset named email-responses-v1
Added 3 test cases covering different scenarios (damaged product, password help, frustrated customer)
Each case has input data and expected qualities we want to verify
Now letβs run our AI model on each test case and evaluate the results.
Python
TypeScript
Copy
# Your AI function that generates email responsesdef generate_email_response(customer_email: str, context: str) -> str: """ This is your actual AI function. For this tutorial, we'll use a simple mock. In production, this would call your LLM (OpenAI, Anthropic, etc.) """ # Mock response for demonstration # Replace this with your actual LLM call return f"Dear valued customer, thank you for contacting us about: {customer_email[:50]}..."# Run the dataset through your AIdataset_run = dataset.run( name="Initial evaluation", description="Testing baseline email response quality", metadata={"model": "gpt-4", "temperature": 0.7})# Process each test casefor item in dataset.items: # Generate AI response input_data = item.input ai_response = generate_email_response( customer_email=input_data["customer_email"], context=input_data["context"] ) # Record the output dataset_run.create_observation( dataset_item_id=item.id, output=ai_response, metadata={"input_case": item.metadata["case_number"]} )print(f"β Completed evaluation run: {dataset_run.id}")print("View results in the ABV platform!")
Copy
// Your AI function that generates email responsesfunction generateEmailResponse(customerEmail: string, context: string): string { /** * This is your actual AI function. For this tutorial, we'll use a simple mock. * In production, this would call your LLM (OpenAI, Anthropic, etc.) */ // Mock response for demonstration // Replace this with your actual LLM call return `Dear valued customer, thank you for contacting us about: ${customerEmail.slice(0, 50)}...`;}// Run the dataset through your AIconst datasetRun = await dataset.run({ name: "Initial evaluation", description: "Testing baseline email response quality", metadata: { model: "gpt-4", temperature: 0.7 }});// Process each test caseconst items = await dataset.items();for (const item of items) { // Generate AI response const inputData = item.input; const aiResponse = generateEmailResponse( inputData.customer_email, inputData.context ); // Record the output await datasetRun.createObservation({ dataset_item_id: item.id, output: aiResponse, metadata: { input_case: item.metadata.case_number } });}console.log(`β Completed evaluation run: ${datasetRun.id}`);console.log("View results in the ABV platform!");
What we did:
Created a dataset run to track this specific evaluation
# Score politeness using LLM-as-a-Judgedef score_politeness_llm(response: str) -> dict: """ Use an LLM to judge if the response is polite """ # In production, call your LLM API # For demo, we'll return a mock score score_value = 0.85 # 0-1 scale return { "value": score_value, "comment": "Response uses professional language and courteous tone" }# Apply scores to each observationfor observation in dataset_run.observations: # Get politeness score politeness = score_politeness_llm(observation.output) # Record the score observation.score( name="politeness", value=politeness["value"], comment=politeness["comment"] )print("β Scored all responses for politeness")
Copy
// Score politeness using LLM-as-a-Judgefunction scorePolitenessLLM(response: string): { value: number; comment: string } { /** * Use an LLM to judge if the response is polite */ // In production, call your LLM API // For demo, we'll return a mock score const scoreValue = 0.85; // 0-1 scale return { value: scoreValue, comment: "Response uses professional language and courteous tone" };}// Apply scores to each observationconst observations = await datasetRun.observations();for (const observation of observations) { // Get politeness score const politeness = scorePolitenesLLM(observation.output); // Record the score await observation.score({ name: "politeness", value: politeness.value, comment: politeness.comment });}console.log("β Scored all responses for politeness");
Based on the results, letβs improve our AI and re-evaluate:
Python
TypeScript
Copy
# Improved version with better promptingdef generate_email_response_v2(customer_email: str, context: str) -> str: """ Version 2: Improved prompt engineering """ # In production, use a better prompt like: prompt = f"""Generate a professional customer service email response.Customer email: {customer_email}Context: {context}Requirements:- Be polite and professional- Address all customer concerns- Offer specific next steps- Show empathy when appropriateResponse:""" # Mock improved response return "Dear Customer, I sincerely apologize for the issue with your order..."# Run evaluation again with v2dataset_run_v2 = dataset.run( name="Improved model (v2)", description="Testing with better prompting", metadata={"model": "gpt-4", "temperature": 0.7, "version": "v2"})for item in dataset.items: input_data = item.input ai_response = generate_email_response_v2( customer_email=input_data["customer_email"], context=input_data["context"] ) dataset_run_v2.create_observation( dataset_item_id=item.id, output=ai_response )# Score the new version# ... (same scoring code as before)print("β Completed v2 evaluation - compare results in the platform!")
Copy
// Improved version with better promptingfunction generateEmailResponseV2(customerEmail: string, context: string): string { /** * Version 2: Improved prompt engineering */ // In production, use a better prompt like: const prompt = `Generate a professional customer service email response.Customer email: ${customerEmail}Context: ${context}Requirements:- Be polite and professional- Address all customer concerns- Offer specific next steps- Show empathy when appropriateResponse:`; // Mock improved response return "Dear Customer, I sincerely apologize for the issue with your order...";}// Run evaluation again with v2const datasetRunV2 = await dataset.run({ name: "Improved model (v2)", description: "Testing with better prompting", metadata: { model: "gpt-4", temperature: 0.7, version: "v2" }});for (const item of await dataset.items()) { const inputData = item.input; const aiResponse = generateEmailResponseV2( inputData.customer_email, inputData.context ); await datasetRunV2.createObservation({ dataset_item_id: item.id, output: aiResponse });}// Score the new version// ... (same scoring code as before)console.log("β Completed v2 evaluation - compare results in the platform!");
Now you can compare the two runs side-by-side in the platform to see improvements!
Congratulations! Youβve completed a full evaluation workflow:β Created a dataset with realistic test cases
β Ran evaluations on your AI outputs
β Applied scores using both LLM-as-a-Judge and custom rules
β Viewed results in the ABV platform
β Iterated by running a second evaluation with improvements
Start with 10-20 high-quality test cases covering key scenarios. Add more as you find gaps. Quality > quantityβfocus on representative, real-world examples.
Should I use LLM-as-a-Judge or custom scoring?
Use both! Custom scoring is fast and deterministic for clear criteria (e.g., βcontains word Xβ). LLM-as-a-Judge is better for nuanced qualities like tone or helpfulness.
How often should I run evaluations?
During development: Every time you change prompts or models
In production: Daily or weekly automated runs
For experiments: Before and after each change to measure impact
Can I evaluate production traces?
Yes! You can score production traces directly without creating datasets. This is called βonline evaluation.β See Evaluations Overview for details.