v0.1.0 · Apache 2.0

Search docs...

Evaluation

Set up eval-driven releases with MLflow, quality gates, and 14 automated scorers.

7 min read

Overview

Recif integrates with MLflow GenAI to provide a comprehensive evaluation pipeline for your agents. Every release can be gated by quality scores, and the platform ships with 14 built-in scorers covering response quality, RAG accuracy, and tool-call correctness.

The 14 MLflow Scorers

Recif uses a registry-based scorer resolution pattern. Each scorer is an MLflow GenAI scorer instantiated at evaluation time.

Response Quality Scorers

ScorerCategoryWhat It Measures
safetySafetyDetects harmful, toxic, or unsafe content in responses
relevance_to_queryQualityHow well the response addresses the user's question
correctnessQualityFactual accuracy of the response against expected output
completenessQualityWhether the response fully answers all parts of the query
fluencyQualityGrammar, readability, and natural language quality
equivalenceQualitySemantic similarity between response and reference answer
summarizationQualityQuality of summarization when the task is to summarize
guidelinesComplianceAdherence to custom guidelines provided in the scorer config
expectations_guidelinesComplianceWhether response meets specific expected behaviors

RAG Scorers

ScorerCategoryWhat It Measures
retrieval_relevanceRAGWhether retrieved documents are relevant to the query
retrieval_groundednessRAGWhether the response is grounded in the retrieved context
retrieval_sufficiencyRAGWhether retrieved context is sufficient to answer the query

Tool Call Scorers

ScorerCategoryWhat It Measures
tool_call_correctnessToolsWhether the agent called the right tools with correct parameters
tool_call_efficiencyToolsWhether the agent used the minimum necessary tool calls

Risk Profiles

Risk profiles determine which scorers run automatically during eval-gated releases. Each profile maps to a preset list of scorers.

LOW -- 2 scorers

Minimum quality bar. Use for internal tools and low-stakes agents.

ScorerPurpose
safetyEnsure no harmful content
relevance_to_queryBasic relevance check

Minimum score threshold: 60%

STANDARD -- 3 scorers

Default for most agents. Adds factual correctness.

ScorerPurpose
safetyEnsure no harmful content
relevance_to_queryRelevance check
correctnessFactual accuracy

Minimum score threshold: 75%

HIGH -- 6 scorers

Strict quality bar for customer-facing and regulated agents.

ScorerPurpose
safetyEnsure no harmful content
relevance_to_queryRelevance check
correctnessFactual accuracy
guidelinesCustom guideline adherence
retrieval_groundednessRAG grounding (if KB attached)
tool_call_correctnessTool use accuracy (if tools attached)

Minimum score threshold: 90%

Golden Datasets

Golden datasets are curated sets of test cases used for consistent evaluation. Each case has an input and an optional expected output.

Create a dataset via API

curl -X POST http://localhost:8080/api/v1/agents/ag_01J.../datasets \
  -H "Content-Type: application/json" \
  -d '{
    "name": "golden-qa",
    "cases": [
      {
        "input": "What is the capital of France?",
        "expected_output": "Paris"
      },
      {
        "input": "Summarize the theory of relativity in one sentence.",
        "expected_output": "Space and time are interwoven and warped by mass and energy."
      },
      {
        "input": "Translate hello to Spanish.",
        "expected_output": "hola"
      }
    ]
  }'

Dataset structure

Each case supports the following fields:

FieldTypeRequiredDescription
inputstringYesThe user query to send to the agent
expected_outputstringNoThe reference answer for correctness scoring
contextstringNoAdditional context for RAG evaluations
metadataobjectNoKey-value pairs for filtering and tagging

Seed datasets

When you first trigger an evaluation, Recif creates a seed dataset with 5 basic test cases if none exist. Replace these with domain-specific cases for meaningful results.

Triggering Evaluations

Manual trigger via API

curl -X POST http://localhost:8080/api/v1/agents/ag_01J.../evaluations \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "golden-qa",
    "version": "1"
  }'

Response:

{
  "data": {
    "id": "ev_01J...",
    "agent_id": "ag_01J...",
    "agent_version": "1",
    "dataset_name": "golden-qa",
    "status": "completed",
    "provider": "mlflow",
    "aggregate_scores": {
      "safety/mean": 0.95,
      "relevance_to_query/mean": 0.88,
      "correctness/mean": 0.82
    },
    "total_cases": 3,
    "passed_cases": 3,
    "started_at": "2026-04-03T10:00:00Z",
    "completed_at": "2026-04-03T10:00:45Z"
  }
}

On release (eval-gated)

When an agent has governance.min_quality_score > 0 in its configuration, every new release is created with status pending_eval. Recif automatically:

  1. Creates the release in recif-state with status pending_eval
  2. Sends the evaluation request to the Corail agent's /control/evaluate endpoint
  3. Corail runs mlflow.genai.evaluate() with the configured risk profile scorers
  4. Corail POSTs results back to POST /api/v1/agents/{id}/releases/{version}/eval-result
  5. Recif approves or rejects the release based on scores

Production sampling

Set evalSampleRate on the Agent CRD (0--100) to automatically evaluate a percentage of production traces:

spec:
  evalSampleRate: 10          # Evaluate 10% of production traffic
  judgeModel: "openai:/gpt-4o-mini"

Eval-Gated Releases

The release lifecycle with evaluation gates:

Deployment Flow

LLM-as-Judge Configuration

The evaluation engine uses an LLM judge to score responses. Configure the judge model per agent:

spec:
  judgeModel: "openai:/gpt-4o-mini"

The judge model is passed to every MLflow scorer as the model parameter. You can use any supported provider:

Judge ModelExample Value
OpenAIopenai:/gpt-4o-mini
Anthropicanthropic:/claude-sonnet-4-20250514
Google AIgoogle-ai:/gemini-2.5-flash

Tip

Use a fast, cost-effective model as the judge (likegpt-4o-mini) since it evaluates every test case in the dataset. The agent being evaluated uses its own configured model.

Feedback Loop

Recif implements a continuous improvement loop where negative user feedback flows back into evaluation datasets.

How it works

  1. User gives a thumbs-down (value < 3 on 1-5 scale, or < 0.6 on 0-1 scale)
  2. Recif extracts the original input from the MLflow trace
  3. The input is appended as a new test case in the agent's golden dataset (with metadata.source: "negative_feedback")
  4. Next evaluation run includes the new case

Submit feedback via API

curl -X POST http://localhost:8080/api/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "trace_id": "tr_abc123",
    "name": "user_rating",
    "value": 1,
    "source": "user",
    "comment": "The answer was completely wrong",
    "agent_id": "ag_01J..."
  }'

Response:

{
  "status": "recorded",
  "trace_id": "tr_abc123",
  "name": "user_rating",
  "value": 1,
  "proxied": true,
  "dataset_appended": true
}

Comparing Evaluation Runs

Compare two evaluation runs side by side to detect regressions:

curl "http://localhost:8080/api/v1/agents/ag_01J.../evaluations/compare?a=ev_run1&b=ev_run2"

Response:

{
  "data": {
    "run_a": "ev_run1",
    "run_b": "ev_run2",
    "metrics": {
      "correctness/mean": { "a": 0.82, "b": 0.91, "diff": 0.09, "winner": "b" },
      "safety/mean": { "a": 0.95, "b": 0.93, "diff": -0.02, "winner": "a" }
    },
    "winner": "b"
  }
}

Listing Evaluation Runs

curl http://localhost:8080/api/v1/agents/ag_01J.../evaluations

Note

Evaluation results are stored in MLflow when available. If MLflow is unreachable, Recif falls back to in-memory storage for dashboard preview. Theproviderfield indicates the data source:mlflow,corail, ormock.

Warning

Mock scores (provider:mock) are generated when the agent pod is unreachable. They provide a dashboard preview but do not reflect real agent quality. Always verify with real evaluations before promoting to production.