Knowledge Bases
Create and manage knowledge bases with the Maree ingestion pipeline for RAG-powered agents.
Overview
Knowledge bases in Recif are powered by Maree, a pluggable ingestion pipeline that processes documents through a four-stage architecture. Documents are ingested into pgvector for vector search and attached to agents for retrieval-augmented generation (RAG).
Pipeline Architecture
Maree follows a four-stage pipeline:
Source --> Processor --> Transformer --> Store| Stage | Purpose | Examples |
|---|---|---|
| Source | Fetch raw content | Google Drive, Jira, Confluence, Databricks, S3, local files |
| Processor | Extract structured content | Docling (PDF, DOCX, HTML), CSV parser |
| Transformer | Generate embeddings | Ollama (nomic-embed-text), OpenAI embeddings |
| Store | Persist vectors | pgvector (PostgreSQL with vector extension) |
Create a Knowledge Base
curl -X POST http://localhost:8080/api/v1/knowledge-bases \
-H "Content-Type: application/json" \
-d '{
"name": "Product Documentation",
"description": "All product docs for the support agent",
"embedding_model": "nomic-embed-text",
"chunk_size": 512,
"chunk_overlap": 50
}'Response:
{
"data": {
"id": "kb_01J...",
"name": "Product Documentation",
"description": "All product docs for the support agent",
"document_count": 0,
"created_at": "2026-04-03T10:00:00Z"
}
}Ingest Documents
Supported Formats
| Format | Extension | Processor |
|---|---|---|
.pdf | Docling | |
| Microsoft Word | .docx | Docling |
| HTML | .html | Docling |
| CSV | .csv | CSV parser |
Ingest via API
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
-H "Content-Type: application/json" \
-d '{
"source": {
"type": "local",
"path": "/data/docs/product-manual.pdf"
}
}'Ingest from a connector
# Google Drive
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
-H "Content-Type: application/json" \
-d '{
"source": {
"type": "google-drive",
"folder_id": "1ABC...",
"credentials_secret": "gdrive-credentials"
}
}'
# S3 bucket
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
-H "Content-Type: application/json" \
-d '{
"source": {
"type": "s3",
"bucket": "my-docs-bucket",
"prefix": "knowledge-base/"
}
}'
# Confluence
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
-H "Content-Type: application/json" \
-d '{
"source": {
"type": "confluence",
"space_key": "DOCS",
"base_url": "https://myorg.atlassian.net"
}
}'Connectors
| Connector | Source Type | Authentication |
|---|---|---|
| Google Drive | google-drive | Service account credentials (Secret) |
| Jira | jira | API token |
| Confluence | confluence | API token |
| Databricks | databricks | Personal access token |
| Amazon S3 | s3 | AWS credentials (IAM role or access key) |
| Local Files | local | Direct filesystem path |
Docling Extraction
Docling is the default document processor for PDF, DOCX, and HTML files. It handles:
- Multi-page PDFs with layout analysis
- Table extraction with structure preservation
- Image OCR for embedded text
- Header/footer removal
- Section-aware chunking
The processor splits documents into chunks based on the configured chunk_size (default: 512 tokens) with chunk_overlap (default: 50 tokens) for context continuity.
pgvector Storage
Knowledge base vectors are stored in PostgreSQL with the pgvector extension. The default Helm chart deploys pgvector/pgvector:pg16 with vector similarity search support.
Embedding model
The default embedding model is Ollama's nomic-embed-text, deployed alongside Recif. Configure alternative embedding models through the knowledge base settings.
| Embedding Provider | Model | Dimensions |
|---|---|---|
| Ollama | nomic-embed-text | 768 |
| OpenAI | text-embedding-3-small | 1536 |
textembedding-gecko | 768 |
Attach a Knowledge Base to an Agent
Via CRD
apiVersion: recif.dev/v1
kind: Agent
metadata:
name: support-agent
namespace: team-default
spec:
name: "Support Agent"
framework: adk
modelType: openai
modelId: gpt-4
strategy: agent-react
knowledgeBases:
- kb_01J...
- kb_02K...Via the Corail environment variable
CORAIL_KNOWLEDGE_BASES='[{"type":"pgvector","connection_url":"postgres://recif:recif_dev@recif-postgresql:5432/recif","kb_id":"kb_01J..."}]'Search a Knowledge Base
Query the knowledge base directly to verify ingestion and relevance:
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../search \
-H "Content-Type: application/json" \
-d '{
"query": "How do I reset my password?",
"top_k": 5
}'Response:
{
"data": {
"results": [
{
"content": "To reset your password, navigate to Settings > Security > Change Password...",
"score": 0.92,
"metadata": {
"source": "product-manual.pdf",
"page": 42,
"chunk_id": "chunk_001"
}
}
]
}
}List Knowledge Bases
curl http://localhost:8080/api/v1/knowledge-basesList Documents in a Knowledge Base
curl http://localhost:8080/api/v1/knowledge-bases/kb_01J.../documentsGet Knowledge Base Details
curl http://localhost:8080/api/v1/knowledge-bases/kb_01J...Tip
When attaching a knowledge base to an agent, the agent automatically uses retrieval-augmented generation. The RAG scorers (retrieval_relevance,retrieval_groundedness,retrieval_sufficiency) become relevant for evaluations -- use the HIGH risk profile to include them.
Note
Maree is a modular system. Each stage (Source, Processor, Transformer, Store) can be replaced with custom implementations. The pluggable architecture supports adding new connectors and processors without changing the pipeline core.