ADR-019: RAG Pipeline Design

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

AI/ML Team - RAG architecture
Architecture Team - Integration patterns

Layer

AI-ML

ADR-017: Dual Embedding Strategy
ADR-018: Semantic Search with pgvector

Supersedes

None

Depends On

ADR-001: PostgreSQL with pgvector
ADR-017: Dual Embedding Strategy
ADR-018: Semantic Search with pgvector

Context

The platform needs AI-powered Q&A for documentation:

Knowledge Base: Large documentation corpus
Natural Questions: Users ask in natural language
Accurate Answers: Responses grounded in actual docs
Citations: Show source of information
Context Limits: LLM context window limits

Retrieval-Augmented Generation (RAG) addresses these by:

Retrieving relevant context before generation
Grounding responses in actual documents
Providing citations to source material

Decision

We implement a RAG pipeline with the following architecture:

Key Design Decisions

Chunking Strategy: Split documents into semantic chunks
Vector Storage: pgvector for chunk embeddings
Retrieval: Top-K similar chunks for query
Prompt Template: Structured prompt with context
Citation Tracking: Map responses to source chunks

Pipeline Architecture

Query → Embed → Retrieve → Rerank → Generate → Cite

Query Embedding: Convert question to vector
Retrieval: Find top-K similar chunks
Reranking: Score chunks for relevance
Generation: LLM with context window
Citation: Track which chunks were used

Chunking Configuration

# config/rag-config.yaml
chunking:
  strategy: semantic  # semantic | fixed | sentence
  max_chunk_size: 1000  # tokens
  overlap: 100  # tokens
  separators: ["\n\n", "\n", ". ", " "]

API Endpoints

POST /api/rag/ask
{
  "question": "How do I configure OAuth?",
  "max_results": 5,
  "include_sources": true
}

Response:
{
  "answer": "To configure OAuth, you need to...",
  "sources": [
    {
      "chunk_id": "doc-001-chunk-5",
      "document": "Authentication Guide",
      "relevance": 0.92,
      "text": "OAuth configuration requires..."
    }
  ],
  "model_used": "gpt-4-turbo"
}

Prompt Template

PROMPT_TEMPLATE = """
You are a helpful assistant for the SRE Operations Platform.
Answer the question based ONLY on the following context.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:
"""

Consequences

Positive

Accurate Answers: Grounded in actual documentation
Reduced Hallucination: LLM constrained to context
Transparent Sources: Users can verify answers
Scalable Knowledge: Add docs without retraining
Cost Effective: Only relevant context sent to LLM

Negative

Retrieval Quality: Bad retrieval = bad answers
Context Limits: Large questions may truncate
Latency: Multiple steps add response time
Chunking Artifacts: May split related content

Neutral

LLM Dependency: Requires external LLM API
Embedding Updates: New docs need embedding

Alternatives Considered

1. Fine-tuned LLM

Approach: Train model on documentation
Rejected: Expensive, outdated quickly, can't cite

2. Full Document Context

Approach: Send entire docs to LLM
Rejected: Exceeds context limits, expensive

3. Keyword Search + LLM

Approach: Use traditional search for retrieval
Rejected: Misses semantic matches

Implementation Status

Implementation Details

RAG Service: backend/services/rag_service.py
Chunking: backend/services/document_chunker.py
API Endpoints: backend/api/v1/rag.py
Configuration: config/rag-config.yaml
Docs: docs/development/ai-ml-features-guide.md

Compliance/Validation

Automated checks: Answer quality evaluation
Manual review: Sample Q&A reviewed weekly
Metrics: Retrieval precision, answer latency

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (95%) Verdict: CONDITIONAL APPROVAL

Quality Metrics

Consensus Strength Score (CSS): 0.88
Deliberation Depth Index (DDI): 0.90

Council Feedback Summary

Solid foundational architecture requiring significant tuning for SRE workflows. Reranking and prompt constraints are praised, but chunking strategy and pure vector retrieval are ill-suited for high-precision SRE documentation.

Key Concerns Identified:

Semantic Chunking Inappropriate for SRE Content: Breaks code blocks mid-syntax, separates "Symptom" from "Remediation"
Pure Vector Retrieval Fails on Exact Matches: Error codes, IP addresses, CLI flags poorly handled by embeddings
Dangerous Command Risk: LLM might hallucinate or retrieve destructive commands without context
Staleness: Old runbooks are dangerous; retrieval should penalize older documents

Required Modifications:

Structure-Aware Chunking:
- Split by Markdown headers (H2/H3) to keep procedures intact
- Keep code blocks and YAML atomic (never split mid-block)
- Lower indexing chunk size to 300-600 tokens for precision
- Use Parent-Child Indexing: Search on small chunks, feed surrounding context to LLM
Adopt Hybrid Search:
- Add BM25 (keyword search) alongside pgvector
- Use Reciprocal Rank Fusion (RRF) for merging
Logic-Layer Guardrails:
- Dangerous Command Detection: Flag rm -rf, DROP TABLE, etc.
- Answerability Gating: Return "I don't know" if reranker score below threshold
Operational Context Injection: Use metadata (Service, Timestamp, Environment) as pre-filtering criteria

Modifications Applied

Documented structure-aware chunking strategy
Added hybrid search (BM25 + vector) requirement
Documented dangerous command detection
Added metadata pre-filtering recommendation

Council Ranking

gemini-3-pro: Best Response (chunking analysis)
claude-opus-4.5: Strong (guardrails focus)
gpt-5.2: Good (hybrid search)

Operational Guidelines (APPROVED_WITH_MODS)

Chunking Strategy Documentation

Chunking Approaches by Document Type:

Document Type	Strategy	Chunk Size	Overlap	Notes
Runbooks	Semantic (by section)	500-1000 tokens	50	Preserve step boundaries
Requirements	Sentence-based	200-400 tokens	20	Keep context
API Docs	Structure-aware	300-600 tokens	30	Respect headers
Markdown	Heading-based	400-800 tokens	40	Split at ##
Code	Function/class	Variable	0	Complete units

Implementation:

# backend/services/rag/chunking.py
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)

class ChunkingStrategy:
    """Document-aware chunking with semantic boundaries."""

    def chunk_document(self, content: str, doc_type: str) -> list[Chunk]:
        if doc_type == "markdown":
            return self._chunk_markdown(content)
        elif doc_type == "runbook":
            return self._chunk_runbook(content)
        elif doc_type == "code":
            return self._chunk_code(content)
        else:
            return self._chunk_generic(content)

    def _chunk_markdown(self, content: str) -> list[Chunk]:
        # Split by headers first
        headers_splitter = MarkdownHeaderTextSplitter(
            headers_to_split_on=[
                ("#", "h1"),
                ("##", "h2"),
                ("###", "h3"),
            ]
        )
        header_chunks = headers_splitter.split_text(content)

        # Then split large sections
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=600,
            chunk_overlap=40,
            separators=["\n\n", "\n", ". ", " "],
        )

        final_chunks = []
        for chunk in header_chunks:
            if len(chunk.page_content) > 600:
                sub_chunks = text_splitter.split_text(chunk.page_content)
                for i, sub in enumerate(sub_chunks):
                    final_chunks.append(Chunk(
                        content=sub,
                        metadata={**chunk.metadata, "part": i},
                    ))
            else:
                final_chunks.append(chunk)

        return final_chunks

    def _chunk_runbook(self, content: str) -> list[Chunk]:
        """Preserve runbook step boundaries."""
        # Split on step markers (1., 2., Step 1, etc.)
        step_pattern = r'\n(?=\d+\.|Step \d+|##)'
        steps = re.split(step_pattern, content)

        return [
            Chunk(content=step.strip(), metadata={"step": i})
            for i, step in enumerate(steps)
            if step.strip()
        ]

Chunk Metadata:

class Chunk(BaseModel):
    content: str
    metadata: dict = Field(default_factory=dict)

    # Required metadata
    source_id: str      # Document ID
    chunk_index: int    # Position in document
    total_chunks: int   # Total chunks in document

    # Optional metadata
    section: str | None    # Section header
    page: int | None       # Page number (PDFs)
    code_type: str | None  # Language (code files)

Citation Accuracy Metrics

Metrics to Track:

Metric	Description	Target
citation_precision	% of citations that are relevant	>90%
citation_recall	% of relevant sources cited	>85%
citation_freshness	Age of cited documents	<30 days avg
hallucination_rate	% of claims without citation	<5%

Implementation:

# backend/services/rag/metrics.py
from prometheus_client import Counter, Histogram, Gauge

RAG_QUERIES = Counter(
    'rag_queries_total',
    'Total RAG queries',
    ['status']
)

RAG_CITATIONS = Histogram(
    'rag_citations_per_response',
    'Number of citations per response',
    buckets=[0, 1, 2, 3, 5, 10]
)

RAG_RELEVANCE_SCORE = Histogram(
    'rag_citation_relevance_score',
    'Relevance score of citations',
    buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

class RAGMetricsCollector:
    def track_response(self, query: str, response: RAGResponse):
        RAG_QUERIES.labels(status='success').inc()
        RAG_CITATIONS.observe(len(response.citations))

        for citation in response.citations:
            RAG_RELEVANCE_SCORE.observe(citation.relevance_score)

    def calculate_citation_precision(
        self,
        citations: list[Citation],
        ground_truth: list[str]
    ) -> float:
        """Calculate precision: relevant citations / total citations."""
        if not citations:
            return 0.0
        relevant = sum(1 for c in citations if c.source_id in ground_truth)
        return relevant / len(citations)

    def calculate_citation_recall(
        self,
        citations: list[Citation],
        ground_truth: list[str]
    ) -> float:
        """Calculate recall: cited sources / all relevant sources."""
        if not ground_truth:
            return 1.0
        cited = set(c.source_id for c in citations)
        return len(cited & set(ground_truth)) / len(ground_truth)

Citation Validation:

class CitationValidator:
    """Validate citations for accuracy and relevance."""

    def validate_response(self, response: RAGResponse) -> ValidationResult:
        issues = []

        for citation in response.citations:
            # Check source exists
            if not self._source_exists(citation.source_id):
                issues.append(f"Source not found: {citation.source_id}")

            # Check quote accuracy
            if citation.quote:
                if not self._verify_quote(citation.source_id, citation.quote):
                    issues.append(f"Quote not found in source: {citation.quote[:50]}...")

            # Check relevance
            if citation.relevance_score < 0.5:
                issues.append(f"Low relevance citation: {citation.source_id}")

        return ValidationResult(
            valid=len(issues) == 0,
            issues=issues,
            precision=self._calculate_precision(response),
        )

References

ADR-019 | AI-ML Layer | Implemented | APPROVED_WITH_MODS Completed

Status​

Date​

Decision Makers​

Layer​

Related ADRs​

Supersedes​

Depends On​

Context​

Decision​

Key Design Decisions​

Pipeline Architecture​

Chunking Configuration​

API Endpoints​

Prompt Template​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

1. Fine-tuned LLM​

2. Full Document Context​

3. Keyword Search + LLM​

Implementation Status​

Implementation Details​

Compliance/Validation​

LLM Council Review​

Quality Metrics​

Council Feedback Summary​

Key Concerns Identified:​

Required Modifications:​

Modifications Applied​

Council Ranking​

Operational Guidelines (APPROVED_WITH_MODS)​

Chunking Strategy Documentation​

Citation Accuracy Metrics​

References​