v0.1.0 · Apache 2.0

Search docs...

Eval-Driven Releases

How Recif uses evaluation quality gates to prevent bad agent versions from reaching production.

4 min read

Overview

Eval-driven releases are a core Recif concept: no agent version reaches production without passing automated quality gates. This is fundamentally different from traditional CI/CD -- you cannot unit-test an LLM's output the same way you test deterministic code.

The Problem

Traditional software deployment relies on code tests: if all tests pass, the build is green, and you deploy. But LLM-powered agents have non-deterministic outputs. The same input can produce different (and differently good) responses depending on the model, prompt, temperature, and context.

How do you know if a prompt change made your agent better or worse? How do you catch quality regressions before users do? How do you enforce a minimum quality bar across your organization?

The Solution: Eval-Gated Releases

Every release in Recif starts with status pending_eval. The release is not applied to production until an evaluation confirms it meets the quality threshold.

Deployment Flow

Detailed Flow

1. Configuration Change

A developer changes the agent's system prompt, model, tools, or skills -- via the dashboard, API, or kubectl.

2. Release Created

The release handler creates an immutable release artifact:

apiVersion: agents.recif.dev/v1
kind: AgentRelease
metadata:
  name: support-agent
  version: 5
  previous: 4
  status: pending_eval
  timestamp: "2026-04-03T10:00:00Z"
  changelog: "Updated system prompt for better greeting"
  checksum: "a1b2c3..."

This artifact is committed to agents/support-agent/releases/v5.yaml in the recif-state repository.

3. Evaluation Triggered

If governance.min_quality_score > 0, Recif sends an async evaluation request to the Corail agent's control plane:

POST http://{agent-slug}.team-default.svc.cluster.local:8001/control/evaluate

The request includes:

  • The golden dataset (test cases with inputs and expected outputs)
  • The risk profile (determines which scorers to use)
  • The minimum quality score threshold
  • A callback URL for results

4. Scoring

The Corail evaluator runs mlflow.genai.evaluate() with the risk-profile scorers. Each test case is sent through the agent pipeline, and the LLM judge scores the response.

5. Callback

Corail POSTs results back to Recif:

POST /api/v1/agents/{id}/releases/{version}/eval-result
{
  "run_id": "ev_01J...",
  "status": "completed",
  "scores": {
    "safety/mean": 0.95,
    "relevance_to_query/mean": 0.88,
    "correctness/mean": 0.82
  },
  "passed": true,
  "verdict": "PASSED (avg=0.883 >= 0.750)"
}

6. Approve or Reject

Approved: The release status changes to active. The artifact is written to current.yaml and the K8s CRD is patched with the new configuration.

Rejected: The release status changes to rejected. The previous active version remains in current.yaml. The CRD is rolled back to the previous configuration. The rejection reason is appended to the changelog.

Comparison with Traditional CI/CD

AspectTraditional CI/CDRecif Eval-Driven
What is testedCode logic, unit tests, integration testsLLM output quality via scorer judges
Test typeDeterministic (pass/fail)Probabilistic (score threshold)
WhenBefore merge (PR checks)Before deploy (release gate)
FeedbackBuild logsMLflow metrics, scorer breakdowns
Regression detectionTest failuresScore comparison between versions
Continuous monitoringUptime/error rateProduction sampling with eval scorers

Why This Is Unique to Recif

Most LLM platforms treat deployment as a binary: the agent is deployed or it is not. Recif introduces a quality gate between "configured" and "live" that is specifically designed for non-deterministic AI outputs.

Key differentiators:

  • 14 purpose-built scorers covering safety, quality, RAG, and tool use
  • Risk profiles that automatically select the right scorers for your use case
  • Golden datasets that grow from negative user feedback
  • Continuous production scoring via trace sampling
  • Version-level comparisons to detect regressions
  • Git-native audit trail where every release (approved or rejected) is an immutable commit

Auto-Promote (No Eval Gate)

If governance.min_quality_score is 0 (the default), releases are auto-promoted immediately without waiting for evaluation. This is useful during development:

Config Change --> Release Created --> Auto-Approved --> current.yaml updated

To enable eval-gated releases, set a minimum quality score in the agent's governance configuration:

{
  "governance": {
    "risk_profile": "standard",
    "min_quality_score": 75,
    "eval_dataset": "golden-qa"
  }
}

Warning

Withmin_quality_score:0, any configuration change is immediately applied to production. Set a non-zero threshold for customer-facing agents.

Tip

Start withmin_quality_score:60(LOW risk profile) and increase as you build a comprehensive golden dataset. The quality of your evaluations depends directly on the quality of your test cases.