ADR-019: RAG Pipeline Design
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- AI/ML Team - RAG architecture
- Architecture Team - Integration patterns
Layer
AI-ML
Related ADRs
- ADR-017: Dual Embedding Strategy
- ADR-018: Semantic Search with pgvector
Supersedes
None
Depends On
- ADR-001: PostgreSQL with pgvector
- ADR-017: Dual Embedding Strategy
- ADR-018: Semantic Search with pgvector
Context
The platform needs AI-powered Q&A for documentation:
- Knowledge Base: Large documentation corpus
- Natural Questions: Users ask in natural language
- Accurate Answers: Responses grounded in actual docs
- Citations: Show source of information
- Context Limits: LLM context window limits
Retrieval-Augmented Generation (RAG) addresses these by:
- Retrieving relevant context before generation
- Grounding responses in actual documents
- Providing citations to source material
Decision
We implement a RAG pipeline with the following architecture:
Key Design Decisions
- Chunking Strategy: Split documents into semantic chunks
- Vector Storage: pgvector for chunk embeddings
- Retrieval: Top-K similar chunks for query
- Prompt Template: Structured prompt with context
- Citation Tracking: Map responses to source chunks
Pipeline Architecture
Query → Embed → Retrieve → Rerank → Generate → Cite
1. Query Embedding: Convert question to vector
2. Retrieval: Find top-K similar chunks
3. Reranking: Score chunks for relevance
4. Generation: LLM with context window
5. Citation: Track which chunks were used
Chunking Configuration
# config/rag-config.yaml
chunking:
strategy: semantic # semantic | fixed | sentence
max_chunk_size: 1000 # tokens
overlap: 100 # tokens
separators: ["\n\n", "\n", ". ", " "]
API Endpoints
POST /api/rag/ask
{
"question": "How do I configure OAuth?",
"max_results": 5,
"include_sources": true
}
Response:
{
"answer": "To configure OAuth, you need to...",
"sources": [
{
"chunk_id": "doc-001-chunk-5",
"document": "Authentication Guide",
"relevance": 0.92,
"text": "OAuth configuration requires..."
}
],
"model_used": "gpt-4-turbo"
}
Prompt Template
PROMPT_TEMPLATE = """
You are a helpful assistant for the SRE Operations Platform.
Answer the question based ONLY on the following context.
If the answer is not in the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:
"""
Consequences
Positive
- Accurate Answers: Grounded in actual documentation
- Reduced Hallucination: LLM constrained to context
- Transparent Sources: Users can verify answers
- Scalable Knowledge: Add docs without retraining
- Cost Effective: Only relevant context sent to LLM
Negative
- Retrieval Quality: Bad retrieval = bad answers
- Context Limits: Large questions may truncate
- Latency: Multiple steps add response time
- Chunking Artifacts: May split related content
Neutral
- LLM Dependency: Requires external LLM API
- Embedding Updates: New docs need embedding
Alternatives Considered
1. Fine-tuned LLM
- Approach: Train model on documentation
- Rejected: Expensive, outdated quickly, can't cite
2. Full Document Context
- Approach: Send entire docs to LLM
- Rejected: Exceeds context limits, expensive
3. Keyword Search + LLM
- Approach: Use traditional search for retrieval
- Rejected: Misses semantic matches
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- RAG Service:
backend/services/rag_service.py - Chunking:
backend/services/document_chunker.py - API Endpoints:
backend/api/v1/rag.py - Configuration:
config/rag-config.yaml - Docs:
docs/development/ai-ml-features-guide.md
Compliance/Validation
- Automated checks: Answer quality evaluation
- Manual review: Sample Q&A reviewed weekly
- Metrics: Retrieval precision, answer latency
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (95%) Verdict: CONDITIONAL APPROVAL
Quality Metrics
- Consensus Strength Score (CSS): 0.88
- Deliberation Depth Index (DDI): 0.90
Council Feedback Summary
Solid foundational architecture requiring significant tuning for SRE workflows. Reranking and prompt constraints are praised, but chunking strategy and pure vector retrieval are ill-suited for high-precision SRE documentation.
Key Concerns Identified:
- Semantic Chunking Inappropriate for SRE Content: Breaks code blocks mid-syntax, separates "Symptom" from "Remediation"
- Pure Vector Retrieval Fails on Exact Matches: Error codes, IP addresses, CLI flags poorly handled by embeddings
- Dangerous Command Risk: LLM might hallucinate or retrieve destructive commands without context
- Staleness: Old runbooks are dangerous; retrieval should penalize older documents
Required Modifications:
- Structure-Aware Chunking:
- Split by Markdown headers (H2/H3) to keep procedures intact
- Keep code blocks and YAML atomic (never split mid-block)
- Lower indexing chunk size to 300-600 tokens for precision
- Use Parent-Child Indexing: Search on small chunks, feed surrounding context to LLM
- Adopt Hybrid Search:
- Add BM25 (keyword search) alongside pgvector
- Use Reciprocal Rank Fusion (RRF) for merging
- Logic-Layer Guardrails:
- Dangerous Command Detection: Flag
rm -rf,DROP TABLE, etc. - Answerability Gating: Return "I don't know" if reranker score below threshold
- Dangerous Command Detection: Flag
- Operational Context Injection: Use metadata (Service, Timestamp, Environment) as pre-filtering criteria
Modifications Applied
- Documented structure-aware chunking strategy
- Added hybrid search (BM25 + vector) requirement
- Documented dangerous command detection
- Added metadata pre-filtering recommendation
Council Ranking
- gemini-3-pro: Best Response (chunking analysis)
- claude-opus-4.5: Strong (guardrails focus)
- gpt-5.2: Good (hybrid search)
Operational Guidelines (APPROVED_WITH_MODS)
Chunking Strategy Documentation
Chunking Approaches by Document Type:
| Document Type | Strategy | Chunk Size | Overlap | Notes |
|---|---|---|---|---|
| Runbooks | Semantic (by section) | 500-1000 tokens | 50 | Preserve step boundaries |
| Requirements | Sentence-based | 200-400 tokens | 20 | Keep context |
| API Docs | Structure-aware | 300-600 tokens | 30 | Respect headers |
| Markdown | Heading-based | 400-800 tokens | 40 | Split at ## |
| Code | Function/class | Variable | 0 | Complete units |
Implementation:
# backend/services/rag/chunking.py
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
)
class ChunkingStrategy:
"""Document-aware chunking with semantic boundaries."""
def chunk_document(self, content: str, doc_type: str) -> list[Chunk]:
if doc_type == "markdown":
return self._chunk_markdown(content)
elif doc_type == "runbook":
return self._chunk_runbook(content)
elif doc_type == "code":
return self._chunk_code(content)
else:
return self._chunk_generic(content)
def _chunk_markdown(self, content: str) -> list[Chunk]:
# Split by headers first
headers_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
)
header_chunks = headers_splitter.split_text(content)
# Then split large sections
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=40,
separators=["\n\n", "\n", ". ", " "],
)
final_chunks = []
for chunk in header_chunks:
if len(chunk.page_content) > 600:
sub_chunks = text_splitter.split_text(chunk.page_content)
for i, sub in enumerate(sub_chunks):
final_chunks.append(Chunk(
content=sub,
metadata={**chunk.metadata, "part": i},
))
else:
final_chunks.append(chunk)
return final_chunks
def _chunk_runbook(self, content: str) -> list[Chunk]:
"""Preserve runbook step boundaries."""
# Split on step markers (1., 2., Step 1, etc.)
step_pattern = r'\n(?=\d+\.|Step \d+|##)'
steps = re.split(step_pattern, content)
return [
Chunk(content=step.strip(), metadata={"step": i})
for i, step in enumerate(steps)
if step.strip()
]
Chunk Metadata:
class Chunk(BaseModel):
content: str
metadata: dict = Field(default_factory=dict)
# Required metadata
source_id: str # Document ID
chunk_index: int # Position in document
total_chunks: int # Total chunks in document
# Optional metadata
section: str | None # Section header
page: int | None # Page number (PDFs)
code_type: str | None # Language (code files)
Citation Accuracy Metrics
Metrics to Track:
| Metric | Description | Target |
|---|---|---|
| citation_precision | % of citations that are relevant | >90% |
| citation_recall | % of relevant sources cited | >85% |
| citation_freshness | Age of cited documents | <30 days avg |
| hallucination_rate | % of claims without citation | <5% |
Implementation:
# backend/services/rag/metrics.py
from prometheus_client import Counter, Histogram, Gauge
RAG_QUERIES = Counter(
'rag_queries_total',
'Total RAG queries',
['status']
)
RAG_CITATIONS = Histogram(
'rag_citations_per_response',
'Number of citations per response',
buckets=[0, 1, 2, 3, 5, 10]
)
RAG_RELEVANCE_SCORE = Histogram(
'rag_citation_relevance_score',
'Relevance score of citations',
buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
class RAGMetricsCollector:
def track_response(self, query: str, response: RAGResponse):
RAG_QUERIES.labels(status='success').inc()
RAG_CITATIONS.observe(len(response.citations))
for citation in response.citations:
RAG_RELEVANCE_SCORE.observe(citation.relevance_score)
def calculate_citation_precision(
self,
citations: list[Citation],
ground_truth: list[str]
) -> float:
"""Calculate precision: relevant citations / total citations."""
if not citations:
return 0.0
relevant = sum(1 for c in citations if c.source_id in ground_truth)
return relevant / len(citations)
def calculate_citation_recall(
self,
citations: list[Citation],
ground_truth: list[str]
) -> float:
"""Calculate recall: cited sources / all relevant sources."""
if not ground_truth:
return 1.0
cited = set(c.source_id for c in citations)
return len(cited & set(ground_truth)) / len(ground_truth)
Citation Validation:
class CitationValidator:
"""Validate citations for accuracy and relevance."""
def validate_response(self, response: RAGResponse) -> ValidationResult:
issues = []
for citation in response.citations:
# Check source exists
if not self._source_exists(citation.source_id):
issues.append(f"Source not found: {citation.source_id}")
# Check quote accuracy
if citation.quote:
if not self._verify_quote(citation.source_id, citation.quote):
issues.append(f"Quote not found in source: {citation.quote[:50]}...")
# Check relevance
if citation.relevance_score < 0.5:
issues.append(f"Low relevance citation: {citation.source_id}")
return ValidationResult(
valid=len(issues) == 0,
issues=issues,
precision=self._calculate_precision(response),
)
References
ADR-019 | AI-ML Layer | Implemented | APPROVED_WITH_MODS Completed