v0.1.0 · Apache 2.0

Search docs...

Governance

Configure scorecards, guardrail policies, risk profiles, and compliance monitoring for your agents.

6 min read

Overview

Governance in Recif provides a structured framework to monitor and control agent quality, safety, cost, and compliance. Every agent gets a scorecard with four weighted dimensions, and platform administrators can enforce guardrail policies that restrict agent behavior.

Scorecards

A scorecard is a multi-dimensional quality assessment for an agent. Recif computes scorecards from MLflow evaluation metrics (or deterministic mock data when MLflow is unavailable).

Four Dimensions

DimensionWeightWhat It Measures
Quality35%Response correctness, relevance, and factual accuracy
Safety30%Harmful content detection, PII exposure, injection attempts
Cost20%Token usage, latency, estimated daily cost
Compliance15%Policy violations, audit coverage

The overall score is a weighted average:

overall = quality * 0.35 + safety * 0.30 + cost * 0.20 + compliance * 0.15

How Each Dimension Is Calculated

Quality (from MLflow metrics):

MetricSourceHow It Maps
correctness/meanMLflow scorerAveraged with relevance, scaled to 0-100
relevance_to_query/meanMLflow scorerAveraged with correctness, scaled to 0-100
source_citation_rateComputedPercentage of responses with source citations
factual_accuracyComputedPercentage of factually correct responses

Safety:

MetricSourceHow It Maps
safety/meanMLflow scorerScaled to 0-100
guard_block_rateRuntime monitoringPercentage of blocked requests (lower is better)
pii_detection_countRuntime monitoringCount of PII leaks (threshold: 5)
injection_attempt_countRuntime monitoringCount of injection attempts (threshold: 2)

Cost:

MetricSourceHow It Maps
avg_tokens_per_requestRuntime monitoringLower is better (threshold: 1500)
avg_latency_msRuntime monitoringLower is better (threshold: 1000ms)
estimated_daily_cost_usdComputedLower is better (threshold: $5)

Compliance:

MetricSourceHow It Maps
policy_violation_countPolicy engineLower is better (threshold: 2)
audit_coverage_pctAudit systemHigher is better (threshold: 85%)

Letter Grades

Each dimension and the overall score receive a letter grade:

GradeScore Range
A>= 90
B>= 80
C>= 70
D>= 60
F< 60

Metric Status

Each individual metric has a status based on its threshold:

StatusMeaning
okValue meets or exceeds threshold
warningValue is below threshold but above critical
criticalValue is far below threshold

Retrieve a scorecard

# All agents
curl http://localhost:8080/api/v1/governance/scorecards
 
# Single agent
curl http://localhost:8080/api/v1/governance/scorecards/ag_01J...

Response:

{
  "data": {
    "agent_id": "ag_01J...",
    "agent_name": "Support Agent",
    "overall": 82.5,
    "quality": {
      "score": 87.3,
      "grade": "B",
      "metrics": [
        { "name": "source_citation_rate", "value": 85.2, "unit": "percent", "threshold": 80, "status": "ok" },
        { "name": "factual_accuracy", "value": 91.0, "unit": "percent", "threshold": 85, "status": "ok" },
        { "name": "response_relevance", "value": 85.7, "unit": "percent", "threshold": 80, "status": "ok" }
      ]
    },
    "safety": { "score": 90.1, "grade": "A", "metrics": [...] },
    "cost": { "score": 72.4, "grade": "C", "metrics": [...] },
    "compliance": { "score": 78.0, "grade": "C", "metrics": [...] },
    "data_source": "mlflow",
    "updated_at": "2026-04-03T10:00:00Z"
  }
}

Note

When MLflow is connected, scorecards use real evaluation data (data_source:"mlflow"). Without MLflow, deterministic mock data is generated for dashboard preview.

Guardrail Policies

Guardrail policies are rules enforced on agent behavior. Recif ships with 4 default policies.

Default Policies

PolicyIDSeverityRuleDescription
Token Limitgp_default_tokenswarningmax_tokens < 4096Restrict max tokens per request to control cost
Latency SLAgp_default_latencycriticalmax_latency < 2000Ensure response latency stays under 2 seconds
Blocked Topicsgp_default_topicscriticalblocked_topics contains violence,illegal_activity,self_harmPrevent agents from discussing forbidden topics
Daily Cost Capgp_default_costwarningmax_cost_per_day < 10.00Alert when daily cost exceeds budget

Policy Structure

Each policy has the following shape:

{
  "id": "gp_default_tokens",
  "name": "Token Limit",
  "description": "Restrict maximum tokens per request to control cost",
  "severity": "warning",
  "enabled": true,
  "rules": [
    {
      "type": "max_tokens",
      "operator": "lt",
      "value": "4096"
    }
  ]
}

Severity levels:

SeverityBehavior
warningLogs a warning alert, does not block
criticalBlocks the action, may trigger rollback

Create a custom policy

curl -X POST http://localhost:8080/api/v1/governance/policies \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Max Response Length",
    "description": "Limit response length to 2000 tokens for chat agents",
    "severity": "warning",
    "enabled": true,
    "rules": [
      {
        "type": "max_tokens",
        "operator": "lt",
        "value": "2000"
      }
    ]
  }'

Update an existing policy

curl -X PUT http://localhost:8080/api/v1/governance/policies/gp_01J... \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Token Limit (Updated)",
    "description": "Increased limit for summarization agents",
    "severity": "warning",
    "enabled": true,
    "rules": [
      {
        "type": "max_tokens",
        "operator": "lt",
        "value": "8192"
      }
    ]
  }'

Delete a policy

curl -X DELETE http://localhost:8080/api/v1/governance/policies/gp_01J...

List all policies

curl http://localhost:8080/api/v1/governance/policies

Risk Profiles

Risk profiles control the evaluation rigor applied to agent releases. They determine which scorers run and the minimum score threshold.

ProfileMin ScoreScorersUse Case
LOW60%safety, relevance_to_queryInternal tools, prototypes
MEDIUM75%safety, relevance_to_query, correctnessStandard agents
HIGH90%safety, relevance_to_query, correctness, guidelines, retrieval_groundedness, tool_call_correctnessCustomer-facing, regulated

List risk profiles

curl http://localhost:8080/api/v1/risk-profiles

Response:

{
  "data": [
    { "id": "rp_LOW...", "name": "LOW", "min_score": 60, "description": "Minimum quality bar" },
    { "id": "rp_MED...", "name": "MEDIUM", "min_score": 75, "description": "Standard quality bar" },
    { "id": "rp_HIG...", "name": "HIGH", "min_score": 90, "description": "Strict quality bar" }
  ]
}

Configure risk profile per agent

Set the risk profile in the agent's governance configuration. This is stored in the agent's config JSONB and propagated to the release artifact:

{
  "governance": {
    "risk_profile": "high",
    "min_quality_score": 90,
    "eval_dataset": "golden-qa",
    "guards": ["pii-detection", "secrets-scanner"],
    "policies": ["default", "custom-compliance"]
  }
}

Policy Enforcement Flow

User Request
    |
    v
Guardrail Check (pre-processing)
    |
    +---> Blocked topics? --> REJECT (critical)
    |
    +---> Token limit? --> WARN or REJECT
    |
    v
Agent Processes Request
    |
    v
Guardrail Check (post-processing)
    |
    +---> Latency SLA exceeded? --> Log WARNING
    |
    +---> Cost cap exceeded? --> Log WARNING, alert
    |
    v
Response Returned
    |
    v
Metrics Updated --> Scorecard Recalculated

Warning

Policies withcriticalseverity will block requests in real-time. Test policies thoroughly in a staging environment before enabling them on production agents.

Tip

Start with theLOWrisk profile and the default policies. Increase rigor as you build confidence in your agent's quality through evaluation data.