Evaluation

What are Online Evaluations and How to Set Them Up for Your AI System Using Maxim AI

Introduction

Building an LLM-powered application is one thing; ensuring it performs optimally in production is another challenge entirely. While evaluation during development is necessary, the non-deterministic nature of LLMs means that all possible scenarios can never be fully covered in testing. This is where online evaluations become crucial.

Online vs. Offline Evaluations

Offline evaluations use curated datasets, scenario simulations, and evaluators to benchmark prompts, workflows, and agents before deployment. They involve running prompt tests or agent tests with predefined datasets and evaluators, comparing versions, and generating reports for regression analysis. Online evaluations, on the other hand, attach evaluators to traces, spans, generations, and retrievals to automatically score real production interactions in real-time. While offline evaluations focus on pre-release testing and experimentation with controlled data, online evaluations provide continuous monitoring and quality checks on live user interactions, helping identify issues that may not have been covered during testing.

Online evaluations allow you to continuously assess your AI application's performance on real production traffic, helping you identify issues early and make necessary adjustments to improve overall performance. In this guide, we'll explore what online evaluations are and how to implement them using Maxim AI's comprehensive evaluation platform.

Why Evaluate Production Logs?

Evaluation on logs helps cover cases or scenarios that might not be covered by test runs, ensuring that the LLM is performing optimally under various real-world conditions. Additionally, it allows for potential issues to be identified early on, which allows for making necessary adjustments to improve the overall performance of the LLM in time.

With Maxim's multi-level evaluation system, you can evaluate at different granularities - from entire conversations (sessions) to individual responses (traces) to specific components (spans) - giving you comprehensive visibility into your AI application's performance.

Understanding Evaluation Levels

Maxim supports evaluating your AI application at three different levels of granularity. This multi-level approach allows you to assess quality at different scopes depending on your use case:

Session-Level Evaluation

Sessions represent multi-turn interactions or conversations. Session-level evaluators assess the quality of an entire conversation flow.

Use session-level evaluation when:

You want to measure conversation quality across multiple turns
You need to evaluate multi-turn coherence, context retention, or conversation flow
You're assessing overall user satisfaction or goal completion
Your evaluator needs access to the full conversation history

Trace-Level Evaluation

Traces represent single interactions or responses. Trace-level evaluators assess individual completions or responses.

Use trace-level evaluation when:

You want to measure the quality of individual responses
You need to evaluate single-turn metrics like helpfulness or accuracy
You're assessing response-specific attributes like tone or formatting

Span-Level Evaluation

Spans represent specific components within a trace, such as a generation, retrieval, tool call, or custom component. Span-level evaluators assess individual components in isolation.

Use span-level evaluation when:

You want to evaluate specific components of your agentic workflow
You need to assess retrieval quality, individual generation steps, or tool usage
You're optimizing specific parts of your application independently
You need component-specific metrics for debugging or optimization

As your AI application grows in complexity, it becomes increasingly difficult to understand how it is performing on different flows and components. This granular insight becomes necessary to identify bottlenecks or low quality areas in your application's or agent's flow. By targeting the underperforming areas, you can optimize overall performance more effectively than using brute force approaches.

Setting Up Online Evaluations

Prerequisites

Before you can evaluate your production logs, you need to have your logging set up to capture interactions between your LLM and users. This requires integrating the Maxim SDK into your application.

Method 1: Auto Evaluation via UI

Auto evaluation allows you to automatically run evaluators on your production logs without writing any additional code.

Step 1: Navigate to Your Repository

Access the repository where you want to configure evaluations.

Step 2: Access Evaluation Configuration

Click the "Configure Evaluation" button in the top right corner of the page. This will open up the evaluation configuration sheet.

Step 3: Configure Evaluators at Each Level

The Auto Evaluation section allows you to configure evaluators at different levels:

Session: Evaluate multi-turn interactions (conversations) as a whole. Use this when you need to assess the quality of an entire conversation or dialogue flow.
Trace: Evaluate a single response to a user. Use this for evaluating individual interactions or single completions.
Span: Evaluate specific components within a trace (configured via the SDK).

Step 4: Add Evaluators

For each level, click "Add evaluators" to select the evaluators you want to run.

Once you select an evaluator, you'll need to map variables to the evaluator's required inputs. For example:

{{input}} might map to the user's input
{{output}} might map to trace[*].output for session-level or trace.output for trace-level
{{context}} might map to retrieved context like retrieval[*].retrievedChunks[*]

Variable mapping guidelines:

Use trace.output to reference a trace's output
Use trace[*].output to reference all outputs in a session
Use retrieval[*].retrievedChunks[*] to reference retrieved context from retrieval spans
Custom mappings can be created by clicking on the mapping field

Step 5: Configure Filters and Sampling

Click "Add filter rule" to create conditions based on various log properties:

Trace ID / Session ID: Filter by specific trace or session identifiers
Input / Output: Filter based on user input or model output content
Error: Filter logs that have errors or specific error types
Model: Filter by the LLM model used (e.g., gpt-4, claude-3, etc.)
Tags: Filter by custom tags you've added to your traces
Metrics: Filter based on evaluation scores or other metrics
Cost: Filter by cost thresholds (e.g., only evaluate expensive requests)
Tokens: Filter by token usage (e.g., evaluate long conversations)
User Feedback: Filter by user ratings or feedback scores
Latency: Filter by response time (e.g., evaluate slow requests)

Step 6: Save Configuration

Click "Save configurations" to activate your auto evaluation setup.

Method 2: Node-Level Evaluation via SDK

Node-Level evaluation enables you to evaluate a trace or its component (a span, generation, or retrieval) in isolation. This can be done via the Maxim SDK's logger using a very simple API.

How the Maxim SDK Logger Evaluates

Two actions are mainly required to evaluate a node:

Attach Evaluators: This action defines what evaluators to run on the particular node. This needs to be called to start an evaluation on any component.
Attach Variables: Once evaluators are attached on a component, each evaluator waits for all the variables it needs to evaluate to be attached to it. Only after all the variables an evaluator needs are attached, does it start processing.

Important notes:

The evaluator will not run until all of the variables it needs are attached to it.
If we don't receive all the variables needed for an evaluator for over 5 minutes, we will start displaying a Missing variables message (although we will still process the evaluator even if variables are received after 5 minutes).

Attaching Evaluators

Use the with_evaluators method to attach evaluators to any component within a trace or the trace itself:

component.evaluate.withEvaluators("evaluator");

// example
generation.evaluate.withEvaluators("clarity", "toxicity");

If you list an evaluator that doesn't exist in your workspace but is available in the store, Maxim will auto-install it for you. If the evaluator is not available in the store as well, it will be ignored.

Providing Variables to Evaluators

Once evaluators are attached to a component, variables can be passed to them via the with_variables method. This method accepts a key-value pair of variable names to their values.

You also need to specify which evaluators you want these variables to be attached to, which can be done by passing the list of evaluator names as the second argument.

component.evaluate.withVariables(
  { variableName: "value" }, // Key-value pair of variables
  ["evaluator"], // List of evaluators
);

// example
retrieval.evaluate.withVariables(
  { output: assistantResponse.choices[0].message.content },
  ["clarity", "toxicity"],
);

You can directly chain the with_variables method after attaching evaluators to any component, allowing you to skip mentioning the evaluator names again:

trace.evaluate
  .withEvaluators("clarity", "toxicity")
  .withVariables({
    input: userInput,
  });

Example: Basic Generation Evaluation

import maxim
from maxim.decorators import trace, generation

# Initialize logger
logger = maxim.Logger(api_key="your-api-key")

@logger.trace(name="customer_support_chat")
def handle_customer_query(user_message: str):
    # Create a generation for the AI response
    generation_config = {
        "id": "support-response-001",
        "provider": "openai",
        "model": "gpt-4",
        "messages": [
            {"role": "system", "content": "You are a helpful customer support agent."},
            {"role": "user", "content": user_message}
        ],
        "model_parameters": {"temperature": 0.7, "max_tokens": 500},
        "name": "customer_support_response"
    }
    
    generation = logger.current_trace().generation(generation_config)
    
    # Attach evaluators to the generation
    generation.evaluate().with_evaluators("clarity", "toxicity", "helpfulness")
    
    # Provide input variable for all evaluators
    generation.evaluate().with_variables(
        {"input": user_message},
        ["clarity", "toxicity", "helpfulness"]
    )
    
    # Simulate AI response
    ai_response = "Thank you for contacting us. I understand your concern about..."
    
    generation.result({
        "choices": [{"message": {"content": ai_response, "role": "assistant"}}],
        "usage": {"total_tokens": 150}
    })
    
    # Provide output variable for evaluation
    generation.evaluate().with_variables(
        {"output": ai_response},
        ["clarity", "toxicity", "helpfulness"]
    )
    
    return ai_response

Example: RAG System Evaluation

import maxim
from maxim.decorators import trace, generation, retrieval

logger = maxim.Logger(api_key="your-api-key")

@logger.trace(name="rag_question_answering")
def answer_question_with_rag(question: str, knowledge_base: list):
    # Step 1: Retrieve relevant documents
    @retrieval(name="document_retrieval", evaluators=["Ragas Context Relevancy"])
    def retrieve_documents(query: str):
        retrieval = maxim.current_retrieval()
        retrieval.input(query)
        
        # Attach evaluation variables
        retrieval.evaluate().with_variables(
            {"input": query},
            ["Ragas Context Relevancy"]
        )
        
        # Simulate document retrieval
        relevant_docs = [
            {"content": "Document 1 content...", "relevance_score": 0.9},
            {"content": "Document 2 content...", "relevance_score": 0.7}
        ]
        
        retrieval.output(relevant_docs)
        
        # Provide context for evaluation
        retrieval.evaluate().with_variables(
            {"context": str(relevant_docs)},
            ["Ragas Context Relevancy"]
        )
        
        return relevant_docs
    
    # Step 2: Generate answer using retrieved context
    retrieved_docs = retrieve_documents(question)
    
    generation_config = {
        "id": "rag-answer-generation",
        "provider": "openai",
        "model": "gpt-4",
        "messages": [
            {
                "role": "system",
                "content": "Answer the question using the provided context. Be accurate and helpful."
            },
            {"role": "user", "content": f"Question: {question}\nContext: {retrieved_docs}"}
        ],
        "name": "rag_answer_generation"
    }
    
    generation = logger.current_trace().generation(generation_config)
    
    # Attach multiple evaluators
    generation.evaluate().with_evaluators(
        "clarity", "accuracy", "completeness", "relevance"
    )
    
    # Provide input variables
    generation.evaluate().with_variables(
        {
            "input": question,
            "context": str(retrieved_docs)
        },
        ["clarity", "accuracy", "completeness", "relevance"]
    )
    
    # Generate answer
    answer = "Based on the provided context, the answer is..."
    
    generation.result({
        "choices": [{"message": {"content": answer, "role": "assistant"}}]
    })
    
    # Provide output for evaluation
    generation.evaluate().with_variables(
        {"output": answer},
        ["clarity", "accuracy", "completeness", "relevance"]
    )
    
    return answer

Advanced: Custom Local Evaluators

You can create custom evaluators by inheriting from BaseEvaluator:

from maxim.evaluators import BaseEvaluator
from maxim.models import (
    LocalEvaluatorResultParameter,
    LocalData,
    LocalEvaluatorReturn,
    PassFailCriteria,
    PassFailCriteriaOnEachEntry,
    PassFailCriteriaForTestrunOverall
)
import re
from typing import Dict

class CustomQualityEvaluator(BaseEvaluator):
    """Custom evaluator for assessing response quality"""
    
    def evaluate(self, result: LocalEvaluatorResultParameter, data: LocalData) -> Dict[str, LocalEvaluatorReturn]:
        output = result.output
        context = result.context_to_evaluate
        
        # Calculate quality score based on multiple factors
        clarity_score = self._assess_clarity(output)
        completeness_score = self._assess_completeness(output, context)
        relevance_score = self._assess_relevance(output, context)
        
        # Weighted average
        overall_score = (clarity_score * 0.4 + completeness_score * 0.3 + relevance_score * 0.3)
        
        reasoning = f"Quality assessment: Clarity={clarity_score:.2f}, Completeness={completeness_score:.2f}, Relevance={relevance_score:.2f}"
        
        return {
            "quality_score": LocalEvaluatorReturn(
                score=overall_score,
                reasoning=reasoning
            )
        }
    
    def _assess_clarity(self, text: str) -> float:
        """Assess text clarity based on readability metrics"""
        sentences = len(re.split(r'[.!?]+', text))
        words = len(text.split())
        avg_sentence_length = words / max(sentences, 1)
        
        if avg_sentence_length > 30:
            return 0.3
        elif avg_sentence_length > 20:
            return 0.6
        else:
            return 0.9
    
    def _assess_completeness(self, output: str, context: str) -> float:
        """Assess how completely the output addresses the context"""
        if not context:
            return 0.5
        
        output_words = set(output.lower().split())
        context_words = set(context.lower().split())
        
        if len(context_words) == 0:
            return 0.5
        
        overlap = len(output_words.intersection(context_words))
        return min(overlap / len(context_words), 1.0)
    
    def _assess_relevance(self, output: str, context: str) -> float:
        """Assess relevance of output to context"""
        if not context:
            return 0.5
        
        output_lower = output.lower()
        context_lower = context.lower()
        
        context_terms = context_lower.split()
        relevant_terms = [term for term in context_terms if term in output_lower]
        
        if len(context_terms) == 0:
            return 0.5
        
        return len(relevant_terms) / len(context_terms)

# Usage with custom evaluator
custom_evaluator = CustomQualityEvaluator(
    pass_fail_criteria={
        "quality_score": PassFailCriteria(
            on_each_entry_pass_if=PassFailCriteriaOnEachEntry(">=", 0.7),
            for_testrun_overall_pass_if=PassFailCriteriaForTestrunOverall(">=", 0.8, "average")
        )
    }
)

generation.evaluate().with_evaluators("clarity", custom_evaluator)

Setting Up Human Evaluation

While automated evaluators can provide baseline assessments, they may not capture nuanced human judgment, context, and emotional understanding. Human evaluation complements automated evaluation by providing qualitative feedback, detailed comments, and rewritten outputs that help refine your AI applications.

Prerequisites

Before setting up human evaluation:

Integrate the Maxim SDK into your application
Create a Human Evaluator in your workspace by navigating to the Evaluators tab from the sidebar

Configuration Steps

Step 1: Navigate to Repository and Configure Evaluation

Click the "Configure evaluation" button in the top right corner of the page to open the evaluation configuration sheet.

Step 2: Select Human Evaluators

In the Human Evaluation section, click "Add evaluator" and choose from:

Session evaluators: For multi-turn interactions (sessions)
Trace evaluators: For single responses (traces)

Step 3: Save Configuration

Click "Save configurations" at the bottom of the sheet to save your human evaluation setup.

Annotating Logs

You can annotate logs from two places:

From the Logs Table

When human evaluators are configured, columns for each evaluator appear in the logs table:

Click on any cell in a human evaluator column
In the annotation form, provide a rating for that evaluator
Optionally add comments or provide a rewritten output
Click Save to submit your annotation

From Trace Details

Open any trace from the logs table
Click the Annotate button in the top right corner of the trace details sheet
In the annotation form, provide ratings for all configured human evaluators at once
Optionally add comments for each evaluator or provide a rewritten output
Save your annotations

Using Saved Views for Annotation Queues

To make specific views based on filters for raters to annotate, you can use saved views. This allows you to create filtered queues of logs that need annotation. Raters can:

Apply filters to narrow down logs that need annotation (e.g., unannotated logs, specific time ranges, or certain criteria)
Save these filtered views for quick access
Use saved views to work through annotation queues systematically

Understanding Annotation Scores

Average scores: When multiple team members annotate the same log, the average score is shown in the table columns
Individual breakdown: Click on any annotation to see individual scores, comments, and rewritten outputs from each annotator
Pass/fail: Scores are evaluated against pass/fail criteria defined in the evaluator configuration
Rewritten outputs: Multiple team members can provide rewritten outputs; all versions are visible in the trace details view

Viewing Evaluation Results

In the Logs Table View

In the logs' table view, you can find the evaluations on a trace in its row towards the left end, displaying the evaluation scores. You can sort the logs by evaluation scores as well by clicking on either of the evaluators' column header.

In the Trace Details

Click on a trace to open its details sheet, then navigate to the Evaluation tab, wherein you can see the evaluation in detail.

Evaluation Summary

The evaluation summary displays the following information (top to bottom, left to right):

How many evaluators passed out of the total evaluators across the trace
How much did all the evaluators' evaluation cost
How many tokens were used across all evaluators' evaluations
What was the total time taken for the evaluation to process

Evaluation Cards by Level

Depending on what levels you configured evaluators for, you'll see separate evaluation cards:

Session evaluation card: Shows evaluators that ran on the entire session (multi-turn conversation)
Trace evaluation card: Shows evaluators that ran on the individual trace (single interaction)
Span evaluation cards: Shows evaluators that ran on specific components within the trace (configured via SDK)

Overview Tab

Shows a summary of all evaluators and their results at the current level.

Individual Evaluator's Tab

Each evaluator has its own tab showing:

Result: Shows whether the evaluator passed or failed
Score: Shows the score of the evaluator
Reason (shown where applicable): Displays the reasoning behind the score of the evaluator, if given
Cost (shown where applicable): Shows the cost of the individual evaluator's evaluation
Tokens used (shown where applicable): Shows the number of tokens used by the individual evaluator's evaluation
Model latency (shown where applicable): Shows the time taken by the model to respond back with a result for an evaluator
Time taken: Shows the time taken by the evaluator to evaluate
Variables used to evaluate: Shows the values that were used to replace the variables with while processing the evaluator
Logs: These are logs that were generated during the evaluation process. They might be useful for debugging errors or issues that occurred during the evaluation

Tree View on the Left Panel

The tree view shows the hierarchical structure of your trace, with evaluation results displayed at each level.

Dataset Curation from Evaluations

Once you have logs and evaluations in Maxim, you can easily curate datasets by filtering and selecting logs based on different criteria.

Steps to Curate Datasets

Filter logs with specific evaluation scores (e.g., bias score greater than 0)
Select all filtered logs using the top-left selector
Click the "Add to dataset" button that appears
Choose to add logs to an existing dataset or create a new dataset
Map the columns and click "Add entries"

This workflow allows you to systematically build datasets from your production logs based on evaluation results, which can be used for further testing, fine-tuning, or analysis.

Setting Up Alerts and Notifications

Maxim allows you to set up alerts and notifications to stay informed about your AI application's performance in real-time.

Notification Channels

Maxim supports multiple notification channels:

Slack: Receive alerts directly in your Slack workspace
PagerDuty: Integrate with your incident management workflow

Managing Integrations

You can manage your notification channel integrations in the following ways:

Edit an integration: Hover over the integration on the Integrations page and click the edit icon
Delete an integration: Hover over the integration and click the delete icon
View integration details: Click on the integration to see its configuration and usage

Performance Metrics Alerts

Monitor your application's performance by setting up alerts for latency, token usage, and cost metrics.

Available Performance Metrics

Set up alerts for:

Latency: Response times for API calls
Token Usage: Token consumption per request
Cost: API usage expenses

Creating Performance Alerts

Navigate to repository and access the repository where you want to set up alerts
Access alerts tab within the repository
Configure alert settings:
- Select Log metrics as the type of alert
- Select a metric (Latency, Token Usage, or Cost)
- Choose an operator (greater than, less than)
- Enter the threshold value
- Set minimum occurrence count
- Define evaluation time range
Select notification channels: Choose where you want to receive alerts
Save your alert

Performance Alert Examples

Monitor Response Time:

Metric: Latency
Operator: Greater than
Threshold: 2000ms
Occurrences: 5 times
Time range: Last 15 minutes

Monitor Token Consumption:

Metric: Token Usage
Operator: Greater than
Threshold: 10000 tokens
Occurrences: 3 times
Time range: Last hour

Monitor Daily Costs:

Metric: Cost
Operator: Greater than
Threshold: $100
Occurrences: 1 time
Time range: Last 24 hours

Quality Metrics Alerts

Monitor your AI application's quality with alerts for evaluation scores and quality checks.

Available Quality Metrics

Set up alerts for various evaluation scores, such as:

Bias-check: Monitor potential biases in AI responses
Toxicity: Check for inappropriate or harmful content
Clarity: Validate clear and understandable output
Factual accuracy: Verify generated information accuracy
Custom evaluators: Monitor your defined evaluation metrics

Creating Quality Alerts

Navigate to repository and access the repository
Access alerts tab within the repository
Select alert type as Evaluation metrics
Configure alert settings:
- Choose an evaluation metric (e.g., "Bias-check")
- The violation criteria is based on your evaluator's type and configuration
- Specify how many times this should occur
- Set the evaluation time range
Choose notification channels: Select where you want to receive alerts
Save your alert

Quality Alert Examples

Bias Check Alert:

Evaluator: Bias-check
Condition: Score > 0.7
Occurrences: 2 times
Time range: Last hour

Toxicity Alert:

Evaluator: Toxicity
Condition: Failed
Occurrences: 1 time
Time range: Immediate

Managing Alerts

Manage your alerts in the following ways:

Edit an alert: Click the options icon (three dots) on the alert card and select "Edit alert"
Delete an alert: Click the options icon and select "Delete alert"
Pause/Resume an alert: Click the options icon and select "Pause alert" or "Resume alert"

Conclusion

Online evaluations are essential for maintaining and improving the quality of AI applications in production. Maxim AI provides a comprehensive platform for implementing multi-level evaluations, from high-level session assessments to granular component-level analysis.

By combining automated evaluations with human annotations and real-time alerting, you can ensure your AI system maintains high quality standards while identifying and addressing issues before they impact users. The flexibility to evaluate at different levels - sessions, traces, and spans - allows you to optimize your application systematically and effectively.

Start implementing online evaluations today to gain deeper insights into your AI application's performance and take your production quality monitoring to the next level.

Introduction

Online vs. Offline Evaluations

Why Evaluate Production Logs?

Understanding Evaluation Levels

Session-Level Evaluation

Trace-Level Evaluation

Span-Level Evaluation

Setting Up Online Evaluations

Prerequisites

Method 1: Auto Evaluation via UI

Step 1: Navigate to Your Repository

Step 2: Access Evaluation Configuration

Step 3: Configure Evaluators at Each Level

Step 4: Add Evaluators

Step 5: Configure Filters and Sampling

Step 6: Save Configuration

Method 2: Node-Level Evaluation via SDK

How the Maxim SDK Logger Evaluates

Attaching Evaluators

Providing Variables to Evaluators

Example: Basic Generation Evaluation

Example: RAG System Evaluation

Advanced: Custom Local Evaluators

Setting Up Human Evaluation

Prerequisites

Configuration Steps

Step 1: Navigate to Repository and Configure Evaluation

Step 2: Select Human Evaluators

Step 3: Save Configuration

Annotating Logs

From the Logs Table

From Trace Details

Using Saved Views for Annotation Queues

Understanding Annotation Scores

Viewing Evaluation Results

In the Logs Table View

In the Trace Details

Evaluation Summary

Evaluation Cards by Level

Overview Tab

Individual Evaluator's Tab

Tree View on the Left Panel

Dataset Curation from Evaluations

Steps to Curate Datasets

Setting Up Alerts and Notifications

Notification Channels

Managing Integrations

Performance Metrics Alerts

Available Performance Metrics

Creating Performance Alerts

Performance Alert Examples

Quality Metrics Alerts

Available Quality Metrics

Creating Quality Alerts

Quality Alert Examples

Managing Alerts

Conclusion

Read next