Skip to main content

ADR-019: RAG Pipeline Design

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • AI/ML Team - RAG architecture
  • Architecture Team - Integration patterns

Layer

AI-ML

  • ADR-017: Dual Embedding Strategy
  • ADR-018: Semantic Search with pgvector

Supersedes

None

Depends On

  • ADR-001: PostgreSQL with pgvector
  • ADR-017: Dual Embedding Strategy
  • ADR-018: Semantic Search with pgvector

Context

The platform needs AI-powered Q&A for documentation:

  1. Knowledge Base: Large documentation corpus
  2. Natural Questions: Users ask in natural language
  3. Accurate Answers: Responses grounded in actual docs
  4. Citations: Show source of information
  5. Context Limits: LLM context window limits

Retrieval-Augmented Generation (RAG) addresses these by:

  • Retrieving relevant context before generation
  • Grounding responses in actual documents
  • Providing citations to source material

Decision

We implement a RAG pipeline with the following architecture:

Key Design Decisions

  1. Chunking Strategy: Split documents into semantic chunks
  2. Vector Storage: pgvector for chunk embeddings
  3. Retrieval: Top-K similar chunks for query
  4. Prompt Template: Structured prompt with context
  5. Citation Tracking: Map responses to source chunks

Pipeline Architecture

Query → Embed → Retrieve → Rerank → Generate → Cite

1. Query Embedding: Convert question to vector
2. Retrieval: Find top-K similar chunks
3. Reranking: Score chunks for relevance
4. Generation: LLM with context window
5. Citation: Track which chunks were used

Chunking Configuration

# config/rag-config.yaml
chunking:
strategy: semantic # semantic | fixed | sentence
max_chunk_size: 1000 # tokens
overlap: 100 # tokens
separators: ["\n\n", "\n", ". ", " "]

API Endpoints

POST /api/rag/ask
{
"question": "How do I configure OAuth?",
"max_results": 5,
"include_sources": true
}

Response:
{
"answer": "To configure OAuth, you need to...",
"sources": [
{
"chunk_id": "doc-001-chunk-5",
"document": "Authentication Guide",
"relevance": 0.92,
"text": "OAuth configuration requires..."
}
],
"model_used": "gpt-4-turbo"
}

Prompt Template

PROMPT_TEMPLATE = """
You are a helpful assistant for the SRE Operations Platform.
Answer the question based ONLY on the following context.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:
"""

Consequences

Positive

  • Accurate Answers: Grounded in actual documentation
  • Reduced Hallucination: LLM constrained to context
  • Transparent Sources: Users can verify answers
  • Scalable Knowledge: Add docs without retraining
  • Cost Effective: Only relevant context sent to LLM

Negative

  • Retrieval Quality: Bad retrieval = bad answers
  • Context Limits: Large questions may truncate
  • Latency: Multiple steps add response time
  • Chunking Artifacts: May split related content

Neutral

  • LLM Dependency: Requires external LLM API
  • Embedding Updates: New docs need embedding

Alternatives Considered

1. Fine-tuned LLM

  • Approach: Train model on documentation
  • Rejected: Expensive, outdated quickly, can't cite

2. Full Document Context

  • Approach: Send entire docs to LLM
  • Rejected: Exceeds context limits, expensive

3. Keyword Search + LLM

  • Approach: Use traditional search for retrieval
  • Rejected: Misses semantic matches

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • RAG Service: backend/services/rag_service.py
  • Chunking: backend/services/document_chunker.py
  • API Endpoints: backend/api/v1/rag.py
  • Configuration: config/rag-config.yaml
  • Docs: docs/development/ai-ml-features-guide.md

Compliance/Validation

  • Automated checks: Answer quality evaluation
  • Manual review: Sample Q&A reviewed weekly
  • Metrics: Retrieval precision, answer latency

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (95%) Verdict: CONDITIONAL APPROVAL

Quality Metrics

  • Consensus Strength Score (CSS): 0.88
  • Deliberation Depth Index (DDI): 0.90

Council Feedback Summary

Solid foundational architecture requiring significant tuning for SRE workflows. Reranking and prompt constraints are praised, but chunking strategy and pure vector retrieval are ill-suited for high-precision SRE documentation.

Key Concerns Identified:

  1. Semantic Chunking Inappropriate for SRE Content: Breaks code blocks mid-syntax, separates "Symptom" from "Remediation"
  2. Pure Vector Retrieval Fails on Exact Matches: Error codes, IP addresses, CLI flags poorly handled by embeddings
  3. Dangerous Command Risk: LLM might hallucinate or retrieve destructive commands without context
  4. Staleness: Old runbooks are dangerous; retrieval should penalize older documents

Required Modifications:

  1. Structure-Aware Chunking:
    • Split by Markdown headers (H2/H3) to keep procedures intact
    • Keep code blocks and YAML atomic (never split mid-block)
    • Lower indexing chunk size to 300-600 tokens for precision
    • Use Parent-Child Indexing: Search on small chunks, feed surrounding context to LLM
  2. Adopt Hybrid Search:
    • Add BM25 (keyword search) alongside pgvector
    • Use Reciprocal Rank Fusion (RRF) for merging
  3. Logic-Layer Guardrails:
    • Dangerous Command Detection: Flag rm -rf, DROP TABLE, etc.
    • Answerability Gating: Return "I don't know" if reranker score below threshold
  4. Operational Context Injection: Use metadata (Service, Timestamp, Environment) as pre-filtering criteria

Modifications Applied

  1. Documented structure-aware chunking strategy
  2. Added hybrid search (BM25 + vector) requirement
  3. Documented dangerous command detection
  4. Added metadata pre-filtering recommendation

Council Ranking

  • gemini-3-pro: Best Response (chunking analysis)
  • claude-opus-4.5: Strong (guardrails focus)
  • gpt-5.2: Good (hybrid search)

Operational Guidelines (APPROVED_WITH_MODS)

Chunking Strategy Documentation

Chunking Approaches by Document Type:

Document TypeStrategyChunk SizeOverlapNotes
RunbooksSemantic (by section)500-1000 tokens50Preserve step boundaries
RequirementsSentence-based200-400 tokens20Keep context
API DocsStructure-aware300-600 tokens30Respect headers
MarkdownHeading-based400-800 tokens40Split at ##
CodeFunction/classVariable0Complete units

Implementation:

# backend/services/rag/chunking.py
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
)

class ChunkingStrategy:
"""Document-aware chunking with semantic boundaries."""

def chunk_document(self, content: str, doc_type: str) -> list[Chunk]:
if doc_type == "markdown":
return self._chunk_markdown(content)
elif doc_type == "runbook":
return self._chunk_runbook(content)
elif doc_type == "code":
return self._chunk_code(content)
else:
return self._chunk_generic(content)

def _chunk_markdown(self, content: str) -> list[Chunk]:
# Split by headers first
headers_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
)
header_chunks = headers_splitter.split_text(content)

# Then split large sections
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=40,
separators=["\n\n", "\n", ". ", " "],
)

final_chunks = []
for chunk in header_chunks:
if len(chunk.page_content) > 600:
sub_chunks = text_splitter.split_text(chunk.page_content)
for i, sub in enumerate(sub_chunks):
final_chunks.append(Chunk(
content=sub,
metadata={**chunk.metadata, "part": i},
))
else:
final_chunks.append(chunk)

return final_chunks

def _chunk_runbook(self, content: str) -> list[Chunk]:
"""Preserve runbook step boundaries."""
# Split on step markers (1., 2., Step 1, etc.)
step_pattern = r'\n(?=\d+\.|Step \d+|##)'
steps = re.split(step_pattern, content)

return [
Chunk(content=step.strip(), metadata={"step": i})
for i, step in enumerate(steps)
if step.strip()
]

Chunk Metadata:

class Chunk(BaseModel):
content: str
metadata: dict = Field(default_factory=dict)

# Required metadata
source_id: str # Document ID
chunk_index: int # Position in document
total_chunks: int # Total chunks in document

# Optional metadata
section: str | None # Section header
page: int | None # Page number (PDFs)
code_type: str | None # Language (code files)

Citation Accuracy Metrics

Metrics to Track:

MetricDescriptionTarget
citation_precision% of citations that are relevant>90%
citation_recall% of relevant sources cited>85%
citation_freshnessAge of cited documents<30 days avg
hallucination_rate% of claims without citation<5%

Implementation:

# backend/services/rag/metrics.py
from prometheus_client import Counter, Histogram, Gauge

RAG_QUERIES = Counter(
'rag_queries_total',
'Total RAG queries',
['status']
)

RAG_CITATIONS = Histogram(
'rag_citations_per_response',
'Number of citations per response',
buckets=[0, 1, 2, 3, 5, 10]
)

RAG_RELEVANCE_SCORE = Histogram(
'rag_citation_relevance_score',
'Relevance score of citations',
buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

class RAGMetricsCollector:
def track_response(self, query: str, response: RAGResponse):
RAG_QUERIES.labels(status='success').inc()
RAG_CITATIONS.observe(len(response.citations))

for citation in response.citations:
RAG_RELEVANCE_SCORE.observe(citation.relevance_score)

def calculate_citation_precision(
self,
citations: list[Citation],
ground_truth: list[str]
) -> float:
"""Calculate precision: relevant citations / total citations."""
if not citations:
return 0.0
relevant = sum(1 for c in citations if c.source_id in ground_truth)
return relevant / len(citations)

def calculate_citation_recall(
self,
citations: list[Citation],
ground_truth: list[str]
) -> float:
"""Calculate recall: cited sources / all relevant sources."""
if not ground_truth:
return 1.0
cited = set(c.source_id for c in citations)
return len(cited & set(ground_truth)) / len(ground_truth)

Citation Validation:

class CitationValidator:
"""Validate citations for accuracy and relevance."""

def validate_response(self, response: RAGResponse) -> ValidationResult:
issues = []

for citation in response.citations:
# Check source exists
if not self._source_exists(citation.source_id):
issues.append(f"Source not found: {citation.source_id}")

# Check quote accuracy
if citation.quote:
if not self._verify_quote(citation.source_id, citation.quote):
issues.append(f"Quote not found in source: {citation.quote[:50]}...")

# Check relevance
if citation.relevance_score < 0.5:
issues.append(f"Low relevance citation: {citation.source_id}")

return ValidationResult(
valid=len(issues) == 0,
issues=issues,
precision=self._calculate_precision(response),
)

References


ADR-019 | AI-ML Layer | Implemented | APPROVED_WITH_MODS Completed