v0.1.0 · Apache 2.0

Search docs...

Knowledge Bases

Create and manage knowledge bases with the Maree ingestion pipeline for RAG-powered agents.

4 min read

Overview

Knowledge bases in Recif are powered by Maree, a pluggable ingestion pipeline that processes documents through a four-stage architecture. Documents are ingested into pgvector for vector search and attached to agents for retrieval-augmented generation (RAG).

Pipeline Architecture

Maree follows a four-stage pipeline:

Source --> Processor --> Transformer --> Store
StagePurposeExamples
SourceFetch raw contentGoogle Drive, Jira, Confluence, Databricks, S3, local files
ProcessorExtract structured contentDocling (PDF, DOCX, HTML), CSV parser
TransformerGenerate embeddingsOllama (nomic-embed-text), OpenAI embeddings
StorePersist vectorspgvector (PostgreSQL with vector extension)

Create a Knowledge Base

curl -X POST http://localhost:8080/api/v1/knowledge-bases \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Product Documentation",
    "description": "All product docs for the support agent",
    "embedding_model": "nomic-embed-text",
    "chunk_size": 512,
    "chunk_overlap": 50
  }'

Response:

{
  "data": {
    "id": "kb_01J...",
    "name": "Product Documentation",
    "description": "All product docs for the support agent",
    "document_count": 0,
    "created_at": "2026-04-03T10:00:00Z"
  }
}

Ingest Documents

Supported Formats

FormatExtensionProcessor
PDF.pdfDocling
Microsoft Word.docxDocling
HTML.htmlDocling
CSV.csvCSV parser

Ingest via API

curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "local",
      "path": "/data/docs/product-manual.pdf"
    }
  }'

Ingest from a connector

# Google Drive
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "google-drive",
      "folder_id": "1ABC...",
      "credentials_secret": "gdrive-credentials"
    }
  }'
 
# S3 bucket
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "s3",
      "bucket": "my-docs-bucket",
      "prefix": "knowledge-base/"
    }
  }'
 
# Confluence
curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../ingest \
  -H "Content-Type: application/json" \
  -d '{
    "source": {
      "type": "confluence",
      "space_key": "DOCS",
      "base_url": "https://myorg.atlassian.net"
    }
  }'

Connectors

ConnectorSource TypeAuthentication
Google Drivegoogle-driveService account credentials (Secret)
JirajiraAPI token
ConfluenceconfluenceAPI token
DatabricksdatabricksPersonal access token
Amazon S3s3AWS credentials (IAM role or access key)
Local FileslocalDirect filesystem path

Docling Extraction

Docling is the default document processor for PDF, DOCX, and HTML files. It handles:

  • Multi-page PDFs with layout analysis
  • Table extraction with structure preservation
  • Image OCR for embedded text
  • Header/footer removal
  • Section-aware chunking

The processor splits documents into chunks based on the configured chunk_size (default: 512 tokens) with chunk_overlap (default: 50 tokens) for context continuity.

pgvector Storage

Knowledge base vectors are stored in PostgreSQL with the pgvector extension. The default Helm chart deploys pgvector/pgvector:pg16 with vector similarity search support.

Embedding model

The default embedding model is Ollama's nomic-embed-text, deployed alongside Recif. Configure alternative embedding models through the knowledge base settings.

Embedding ProviderModelDimensions
Ollamanomic-embed-text768
OpenAItext-embedding-3-small1536
Googletextembedding-gecko768

Attach a Knowledge Base to an Agent

Via CRD

apiVersion: recif.dev/v1
kind: Agent
metadata:
  name: support-agent
  namespace: team-default
spec:
  name: "Support Agent"
  framework: adk
  modelType: openai
  modelId: gpt-4
  strategy: agent-react
  knowledgeBases:
    - kb_01J...
    - kb_02K...

Via the Corail environment variable

CORAIL_KNOWLEDGE_BASES='[{"type":"pgvector","connection_url":"postgres://recif:recif_dev@recif-postgresql:5432/recif","kb_id":"kb_01J..."}]'

Search a Knowledge Base

Query the knowledge base directly to verify ingestion and relevance:

curl -X POST http://localhost:8080/api/v1/knowledge-bases/kb_01J.../search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I reset my password?",
    "top_k": 5
  }'

Response:

{
  "data": {
    "results": [
      {
        "content": "To reset your password, navigate to Settings > Security > Change Password...",
        "score": 0.92,
        "metadata": {
          "source": "product-manual.pdf",
          "page": 42,
          "chunk_id": "chunk_001"
        }
      }
    ]
  }
}

List Knowledge Bases

curl http://localhost:8080/api/v1/knowledge-bases

List Documents in a Knowledge Base

curl http://localhost:8080/api/v1/knowledge-bases/kb_01J.../documents

Get Knowledge Base Details

curl http://localhost:8080/api/v1/knowledge-bases/kb_01J...

Tip

When attaching a knowledge base to an agent, the agent automatically uses retrieval-augmented generation. The RAG scorers (retrieval_relevance,retrieval_groundedness,retrieval_sufficiency) become relevant for evaluations -- use the HIGH risk profile to include them.

Note

Maree is a modular system. Each stage (Source, Processor, Transformer, Store) can be replaced with custom implementations. The pluggable architecture supports adding new connectors and processors without changing the pipeline core.