Knowledge Bases

Create and manage knowledge bases with the Maree ingestion pipeline for RAG-powered agents.

4 min read

Overview

Knowledge bases in Recif are powered by Maree, a pluggable ingestion pipeline that processes documents through a four-stage architecture. Documents are ingested into pgvector for vector search and attached to agents for retrieval-augmented generation (RAG).

Pipeline Architecture

Maree follows a four-stage pipeline:

Source --> Processor --> Transformer --> Store

Stage	Purpose	Examples
Source	Fetch raw content	Google Drive, Jira, Confluence, Databricks, S3, local files
Processor	Extract structured content	Docling (PDF, DOCX, HTML), CSV parser
Transformer	Generate embeddings	Ollama (`nomic-embed-text`), OpenAI embeddings
Store	Persist vectors	pgvector (PostgreSQL with vector extension)

Create a Knowledge Base

curl -X POST http://localhost:8080/api/v1/knowledge-bases \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Product Documentation",
    "description": "All product docs for the support agent",
    "embedding_model": "nomic-embed-text",
    "chunk_size": 512,
    "chunk_overlap": 50
  }'

Response:

{
  "data": {
    "id": "kb_01J...",
    "name": "Product Documentation",
    "description": "All product docs for the support agent",
    "document_count": 0,
    "created_at": "2026-04-03T10:00:00Z"
  }
}

Ingest Documents

Supported Formats

Format	Extension	Processor
PDF	`.pdf`	Docling
Microsoft Word	`.docx`	Docling
HTML	`.html`	Docling
CSV	`.csv`	CSV parser

Ingest via API

curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "local",
      "path": "/data/docs/product-manual.pdf"
    }
  }'

Ingest from a connector

# Google Drive
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "google-drive",
      "folder_id": "1ABC...",
      "credentials_secret": "gdrive-credentials"
    }
  }'
 
# S3 bucket
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "s3",
      "bucket": "my-docs-bucket",
      "prefix": "knowledge-base/"
    }
  }'
 
# Confluence
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "confluence",
      "space_key": "DOCS",
      "base_url": "https://myorg.atlassian.net"
    }
  }'

Connectors

Connector	Source Type	Authentication
Google Drive	`google-drive`	Service account credentials (Secret)
Jira	`jira`	API token
Confluence	`confluence`	API token
Databricks	`databricks`	Personal access token
Amazon S3	`s3`	AWS credentials (IAM role or access key)
Local Files	`local`	Direct filesystem path

Docling Extraction

Docling is the default document processor for PDF, DOCX, and HTML files. It handles:

Multi-page PDFs with layout analysis
Table extraction with structure preservation
Image OCR for embedded text
Header/footer removal
Section-aware chunking

The processor splits documents into chunks based on the configured chunk_size (default: 512 tokens) with chunk_overlap (default: 50 tokens) for context continuity.

pgvector Storage

Knowledge base vectors are stored in PostgreSQL with the pgvector extension. The default Helm chart deploys pgvector/pgvector:pg16 with vector similarity search support.

Embedding model

The default embedding model is Ollama's nomic-embed-text, deployed alongside Recif. Configure alternative embedding models through the knowledge base settings.

Embedding Provider	Model	Dimensions
Ollama	`nomic-embed-text`	768
OpenAI	`text-embedding-3-small`	1536
Google	`textembedding-gecko`	768

Attach a Knowledge Base to an Agent

Via CRD

apiVersion: recif.dev/v1
kind: Agent
metadata:
  name: support-agent
  namespace: team-default
spec:
  name: "Support Agent"
  framework: adk
  modelType: openai
  modelId: gpt-4
  strategy: agent-react
  knowledgeBases:
    - kb_01J...
    - kb_02K...

Via the Corail environment variable

CORAIL_KNOWLEDGE_BASES='[{"type":"pgvector","connection_url":"postgres://recif:recif_dev@recif-postgresql:5432/recif","kb_id":"kb_01J..."}]'

Search a Knowledge Base

Query the knowledge base directly to verify ingestion and relevance:

curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I reset my password?",
    "top_k": 5
  }'

Response:

{
  "data": {
    "results": [
      {
        "content": "To reset your password, navigate to Settings > Security > Change Password...",
        "score": 0.92,
        "metadata": {
          "source": "product-manual.pdf",
          "page": 42,
          "chunk_id": "chunk_001"
        }
      }
    ]
  }
}

List Knowledge Bases

curl http://localhost:8080/api/v1/knowledge-bases

List Documents in a Knowledge Base

curl http://localhost:8080/api/v1/knowledge-bases/kb_01J.../documents

Get Knowledge Base Details

curl http://localhost:8080/api/v1/knowledge-bases/kb_01J...

Tip

When attaching a knowledge base to an agent, the agent automatically uses retrieval-augmented generation. The RAG scorers (retrieval_relevance,retrieval_groundedness,retrieval_sufficiency) become relevant for evaluations -- use the HIGH risk profile to include them.

Note

Maree is a modular system. Each stage (Source, Processor, Transformer, Store) can be replaced with custom implementations. The pluggable architecture supports adding new connectors and processors without changing the pipeline core.

Overview#

Pipeline Architecture#

Create a Knowledge Base#

Ingest Documents#

Supported Formats#

Ingest via API#

Ingest from a connector#

Connectors#

Docling Extraction#

pgvector Storage#

Embedding model#

Attach a Knowledge Base to an Agent#

Via CRD#

Via the Corail environment variable#

Search a Knowledge Base#

List Knowledge Bases#

List Documents in a Knowledge Base#

Get Knowledge Base Details#