Agent Lifecycle

Understand the full lifecycle of an agent in Recif: from creation through deployment, running, and termination.

5 min read

Overview

Every agent in Recif progresses through a well-defined lifecycle managed by the Kubernetes operator. The operator continuously reconciles the declared state (Agent CRD) with the actual state (pods, services, configmaps) in the cluster.

Four Phases

The Agent CRD defines four status phases:

Pending --> Running --> Failed (recoverable)
                  \
                   --> Terminated

Phase	Description	What Triggers It
Pending	Agent resource created, operator is provisioning	`kubectl apply` or API `POST /agents/{id}/deploy`
Running	Deployment is healthy, pods are ready, service is reachable	Operator confirms pod readiness
Failed	Deployment has errors (crash loop, image pull failure, etc.)	Pod enters CrashLoopBackOff or similar
Terminated	Agent has been stopped or deleted	API `POST /agents/{id}/stop` or `DELETE /agents/{id}`

What Happens at Each Transition

Created (API) --> Pending (K8s)

The API creates the agent record in PostgreSQL
POST /agents/{id}/deploy triggers the operator
The operator reads the Agent CRD spec and creates:
- A Deployment with the Corail container image
- A Service on port 8000 (data plane) and 8001 (control plane)
- A ConfigMap with agent configuration (model type, strategy, tools, etc.)
- Env vars from envSecrets (defaults to agent-env)
Agent CRD status is set to Pending

Pending --> Running

The Kubernetes scheduler places the pod on a node
The Corail container starts and loads the model configuration
The health endpoint (/healthz) begins responding
The operator detects pod readiness and updates:
- status.phase to Running
- status.replicas to the observed count
- status.endpoint to the in-cluster service URL
A AgentDeployed event is emitted on the event bus
The release handler creates a new versioned release in recif-state

Running --> Failed

The Corail container crashes (bad config, missing API key, model not found)
Kubernetes restarts the pod (CrashLoopBackOff)
The operator detects the unhealthy state
status.phase is updated to Failed
Events are logged for debugging via GET /agents/{id}/events

Note

Failed agents are not automatically deleted. Fix the configuration and the operator will reconcile back to Running.

Running --> Terminated

Termination can happen via stop or delete:

Stop (POST /agents/{id}/stop):

The operator scales the Deployment to 0 replicas
status.phase is set to Terminated
The agent record is preserved -- restart with POST /agents/{id}/restart

Delete (DELETE /agents/{id}):

The API marks the agent as deleted in the database
The operator deletes the Deployment, Service, and ConfigMap
A tombstone is written to recif-state (the release history is preserved)
The agent record is soft-deleted with a deleted_at timestamp

Operator Reconciliation Loop

The Recif operator watches for changes to Agent CRDs and reconciles the desired state every 30 seconds.

Watch Agent CRD changes
    |
    v
Compare spec vs actual state
    |
    +---> Spec changed? --> Update Deployment/ConfigMap
    |
    +---> Replicas mismatch? --> Scale Deployment
    |
    +---> Pod unhealthy? --> Update status to Failed
    |
    +---> Pod ready? --> Update status to Running
    |
    v
Update AgentStatus (phase, replicas, endpoint, conditions)

The reconciliation handles:

Image updates -- Rolling update of the Deployment
Config changes -- Recreate ConfigMap, trigger pod restart
Replica scaling -- Scale the Deployment up or down (1-10)
Canary management -- Create/delete canary Deployments and Services
Secret propagation -- Mount envSecrets and credentialSecret into pods

Version Tracking (Releases)

Every meaningful change to an agent creates a new release in the recif-state Git repository. The release pipeline is event-driven:

Event	Trigger	Result
`AgentDeployed`	First deploy or restart	Release vN created
`AgentConfigChanged`	Config update via API/dashboard	Release vN+1 created
`AgentCanaryPromoted`	Canary promoted to stable	Release vN+2 created
`AgentDeleted`	Agent deleted	Tombstone written

Each release is an immutable YAML artifact stored at agents/{slug}/releases/v{N}.yaml. The active release is also written to agents/{slug}/current.yaml.

Deletion Flow

When an agent is deleted, the platform ensures a clean teardown while preserving audit history:

Database -- Agent record receives a deleted_at timestamp (soft delete)
Kubernetes -- Deployment, Service, and ConfigMap are deleted from the team namespace
Git state -- A tombstone is written to agents/{slug}/current.yaml:

apiVersion: agents.recif.dev/v1
kind: AgentRelease
metadata:
  name: support-agent
  status: deleted
  deleted_at: "2026-04-03T10:00:00Z"
  changelog: "Agent deleted from platform"

Release history -- All versioned releases (v1.yaml, v2.yaml, ...) remain in Git for audit

Tip

The tombstone pattern ensures that even deleted agents have a complete audit trail. You can inspect the full release history of any agent, including deleted ones, by browsing therecif-staterepository.

Restart and Recovery

Stopped agents (Terminated phase) can be restarted:

curl -X POST http://localhost:8080/api/v1/agents/ag_01J.../restart

This scales the Deployment back to the configured replica count. The operator reconciles and transitions the agent back to Running.

Failed agents automatically recover when the underlying issue is fixed (e.g., a missing Secret is created, a misconfigured model type is corrected). The operator picks up the CRD change on the next reconciliation cycle.

Overview#

Four Phases#

What Happens at Each Transition#

Created (API) --> Pending (K8s)#

Pending --> Running#

Running --> Failed#

Running --> Terminated#

Operator Reconciliation Loop#

Version Tracking (Releases)#

Deletion Flow#

Restart and Recovery#

Overview

Four Phases

What Happens at Each Transition

Created (API) --> Pending (K8s)

Pending --> Running

Running --> Failed

Running --> Terminated

Operator Reconciliation Loop

Version Tracking (Releases)

Deletion Flow

Restart and Recovery