Agent Lifecycle
Understand the full lifecycle of an agent in Recif: from creation through deployment, running, and termination.
Overview
Every agent in Recif progresses through a well-defined lifecycle managed by the Kubernetes operator. The operator continuously reconciles the declared state (Agent CRD) with the actual state (pods, services, configmaps) in the cluster.
Four Phases
The Agent CRD defines four status phases:
Pending --> Running --> Failed (recoverable)
\
--> Terminated| Phase | Description | What Triggers It |
|---|---|---|
| Pending | Agent resource created, operator is provisioning | kubectl apply or API POST /agents/{id}/deploy |
| Running | Deployment is healthy, pods are ready, service is reachable | Operator confirms pod readiness |
| Failed | Deployment has errors (crash loop, image pull failure, etc.) | Pod enters CrashLoopBackOff or similar |
| Terminated | Agent has been stopped or deleted | API POST /agents/{id}/stop or DELETE /agents/{id} |
What Happens at Each Transition
Created (API) --> Pending (K8s)
- The API creates the agent record in PostgreSQL
POST /agents/{id}/deploytriggers the operator- The operator reads the Agent CRD spec and creates:
- A Deployment with the Corail container image
- A Service on port 8000 (data plane) and 8001 (control plane)
- A ConfigMap with agent configuration (model type, strategy, tools, etc.)
- Env vars from
envSecrets(defaults toagent-env)
- Agent CRD status is set to
Pending
Pending --> Running
- The Kubernetes scheduler places the pod on a node
- The Corail container starts and loads the model configuration
- The health endpoint (
/healthz) begins responding - The operator detects pod readiness and updates:
status.phasetoRunningstatus.replicasto the observed countstatus.endpointto the in-cluster service URL
- A
AgentDeployedevent is emitted on the event bus - The release handler creates a new versioned release in
recif-state
Running --> Failed
- The Corail container crashes (bad config, missing API key, model not found)
- Kubernetes restarts the pod (CrashLoopBackOff)
- The operator detects the unhealthy state
status.phaseis updated toFailed- Events are logged for debugging via
GET /agents/{id}/events
Note
Failed agents are not automatically deleted. Fix the configuration and the operator will reconcile back to Running.
Running --> Terminated
Termination can happen via stop or delete:
Stop (POST /agents/{id}/stop):
- The operator scales the Deployment to 0 replicas
status.phaseis set toTerminated- The agent record is preserved -- restart with
POST /agents/{id}/restart
Delete (DELETE /agents/{id}):
- The API marks the agent as deleted in the database
- The operator deletes the Deployment, Service, and ConfigMap
- A tombstone is written to
recif-state(the release history is preserved) - The agent record is soft-deleted with a
deleted_attimestamp
Operator Reconciliation Loop
The Recif operator watches for changes to Agent CRDs and reconciles the desired state every 30 seconds.
Watch Agent CRD changes
|
v
Compare spec vs actual state
|
+---> Spec changed? --> Update Deployment/ConfigMap
|
+---> Replicas mismatch? --> Scale Deployment
|
+---> Pod unhealthy? --> Update status to Failed
|
+---> Pod ready? --> Update status to Running
|
v
Update AgentStatus (phase, replicas, endpoint, conditions)The reconciliation handles:
- Image updates -- Rolling update of the Deployment
- Config changes -- Recreate ConfigMap, trigger pod restart
- Replica scaling -- Scale the Deployment up or down (1-10)
- Canary management -- Create/delete canary Deployments and Services
- Secret propagation -- Mount
envSecretsandcredentialSecretinto pods
Version Tracking (Releases)
Every meaningful change to an agent creates a new release in the recif-state Git repository. The release pipeline is event-driven:
| Event | Trigger | Result |
|---|---|---|
AgentDeployed | First deploy or restart | Release vN created |
AgentConfigChanged | Config update via API/dashboard | Release vN+1 created |
AgentCanaryPromoted | Canary promoted to stable | Release vN+2 created |
AgentDeleted | Agent deleted | Tombstone written |
Each release is an immutable YAML artifact stored at agents/{slug}/releases/v{N}.yaml. The active release is also written to agents/{slug}/current.yaml.
Deletion Flow
When an agent is deleted, the platform ensures a clean teardown while preserving audit history:
- Database -- Agent record receives a
deleted_attimestamp (soft delete) - Kubernetes -- Deployment, Service, and ConfigMap are deleted from the team namespace
- Git state -- A tombstone is written to
agents/{slug}/current.yaml:
apiVersion: agents.recif.dev/v1
kind: AgentRelease
metadata:
name: support-agent
status: deleted
deleted_at: "2026-04-03T10:00:00Z"
changelog: "Agent deleted from platform"- Release history -- All versioned releases (
v1.yaml,v2.yaml, ...) remain in Git for audit
Tip
The tombstone pattern ensures that even deleted agents have a complete audit trail. You can inspect the full release history of any agent, including deleted ones, by browsing therecif-staterepository.
Restart and Recovery
Stopped agents (Terminated phase) can be restarted:
curl -X POST http://localhost:8080/api/v1/agents/ag_01J.../restartThis scales the Deployment back to the configured replica count. The operator reconciles and transitions the agent back to Running.
Failed agents automatically recover when the underlying issue is fixed (e.g., a missing Secret is created, a misconfigured model type is corrected). The operator picks up the CRD change on the next reconciliation cycle.