HybridRAG: Two-Stage Retrieval for AI Memory

February 10, 2026 · 9 min read

Pure vector search isn't enough. Here's how we combine BM25, embeddings, and cross-encoder reranking to achieve accurate memory retrieval.

Vector search revolutionized information retrieval. Embed your documents, embed your query, find the nearest neighbors. Simple, elegant, and... often wrong.

The problem? Vector embeddings excel at semantic similarity but struggle with exact matches. Ask "What's the Redis configuration?" and pure vector search might return documents about "cache settings" or "in-memory databases" while missing the one that literally contains "Redis configuration."

HybridRAG solves this by combining the strengths of multiple retrieval methods, then using a neural reranker to pick the best results.

The Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Query Input                              │
├─────────────────────────────────────────────────────────────────┤
│ Stage 1: Parallel Candidate Generation                          │
├──────────────────────┬──────────────────────────────────────────┤
│  BM25 Search         │  Vector Search                           │
│  (Sparse Keywords)   │  (Dense Semantics)                       │
│  Top 50              │  Top 50                                  │
└──────────────────────┴──────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Stage 2: Fusion + Reranking                                     │
├────────────────────────────┬────────────────────────────────────┤
│  RRF Fusion                │  Cross-Encoder Reranking           │
│  (Score-agnostic merge)    │  (Deep relevance scoring)          │
└────────────────────────────┴────────────────────────────────────┘
                              │
                              ▼
                      Final Results (Top K)

Two stages. Multiple signals. One answer.

Stage 1: Parallel Candidate Generation

Stage 1 runs two search methods in parallel, each returning 50 candidates.

BM25: The Keyword Hunter

BM25 (Best Match 25) is a probabilistic ranking function that scores documents based on term frequency and inverse document frequency. It's been the backbone of search engines for decades.

BM25(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))

Translation: Documents score higher when they contain rare query terms (high IDF) that appear frequently in the document (high TF), normalized by document length.

Why BM25 matters:

Query Type	BM25	Vector Search
"PostgreSQL 15.2 configuration"	Exact match on version	Semantic drift to "database setup"
"REDIS_CONNECTION_TIMEOUT"	Finds the config key	Returns cache-related docs
"ADR-003"	Direct hit	Might miss (ADR = rare term)

BM25 finds the needle. Vector search finds things that look like needles.

Vector Search: The Semantic Finder

Dense vector search embeds queries and documents into a shared embedding space (typically 384-1536 dimensions depending on the model), then finds nearest neighbors by cosine similarity.

# Using sentence-transformers (e.g., all-MiniLM-L6-v2 = 384 dims)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode("How should I configure caching?")
similarities = np.dot(doc_embeddings, query_embedding)
top_k = np.argsort(similarities)[-50:]

Why vectors matter:

Query	Relevant Document	Similarity
"How to speed up the app"	"Performance optimization guide"	0.89
"Authentication best practices"	"Securing user login flows"	0.87
"Reduce memory usage"	"Heap size tuning for JVM"	0.82

Vectors understand synonyms, paraphrasing, and conceptual relationships that keyword search misses.

Running in Parallel

Both searches are independent, so we run them concurrently:

async def stage1(query: str, user_id: str):
    bm25_task = asyncio.to_thread(self.bm25.search, user_id, query, 50)
    vector_task = asyncio.to_thread(self.vector.search, user_id, query, 50)

    bm25_results, vector_results = await asyncio.gather(bm25_task, vector_task)
    return bm25_results, vector_results

Stage 1 takes ~200-400ms (limited by vector embedding).

Stage 2: Fusion + Reranking

Now we have two ranked lists. How do we combine them?

The Naive Approach (Don't Do This)

Normalize scores and average:

# DON'T DO THIS
bm25_normalized = (bm25_score - min) / (max - min)
vector_normalized = (vector_score - min) / (max - min)
combined = (bm25_normalized + vector_normalized) / 2

This fails because:

Score distributions differ wildly (BM25: 0-20, vectors: 0-1)
Outliers dominate after normalization
Assumes scores are comparable (they're not)

Reciprocal Rank Fusion (RRF)

RRF ignores scores entirely and works with ranks:

RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k=60 (from the original paper) and rank_i(d) is the 1-indexed position of document d in list i.

Example:

BM25:   [doc1, doc2, doc3]  (ranks 1, 2, 3)
Vector: [doc2, doc1, doc4]  (ranks 1, 2, 3)

RRF scores (k=60):
doc1: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325
doc2: 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325
doc3: 1/(60+3)            = 0.0159
doc4: 1/(60+3)            = 0.0159

Both doc1 and doc2 tie because they appeared in both lists at similar ranks.

Why RRF works:

Score-agnostic: Works with any ranking function
Robust: Not sensitive to outliers or extreme scores
Empirically strong: Often competitive with learned fusion methods

Implementation:

from collections import defaultdict

def reciprocal_rank_fusion(
    *ranked_lists: list[tuple[str, float]],
    k: int = 60
) -> list[tuple[str, float]]:
    """
    Fuse multiple ranked lists using RRF.

    Args:
        ranked_lists: Each list contains (doc_id, score) tuples, sorted by score desc
        k: RRF constant (default 60 from original paper)

    Returns:
        Fused list of (doc_id, rrf_score) sorted by RRF score desc
    """
    scores = defaultdict(float)

    for ranked_list in ranked_lists:
        for rank, (doc_id, _) in enumerate(ranked_list, start=1):
            scores[doc_id] += 1 / (k + rank)

    return sorted(scores.items(), key=lambda x: -x[1])

# Usage
fused = reciprocal_rank_fusion(bm25_results, vector_results, k=60)

Cross-Encoder Reranking

RRF gives us a fused candidate list. But we can do better.

A cross-encoder is a neural model that takes (query, document) pairs and outputs a relevance score. Unlike bi-encoders (vector search), it sees both texts together, enabling deeper semantic understanding.

# Bi-encoder (separate encoding)
query_emb = model.encode(query)
doc_emb = model.encode(doc)
score = cosine(query_emb, doc_emb)  # Approximate

# Cross-encoder (joint encoding)
score = cross_encoder.predict([(query, doc)])  # Precise

Cross-encoders are significantly more accurate but much slower per-pair (the exact ratio depends on batching, hardware, and model size). That's why we only use them for reranking the top ~50 candidates, not the entire corpus.

Model choice: cross-encoder/ms-marco-MiniLM-L-6-v2

Trained on MS MARCO passage ranking dataset
Fast (~2ms per pair on CPU, faster with batching/GPU)
Good balance of speed and accuracy

from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, candidates: list, top_k: int = 10):
    """Rerank candidates using cross-encoder relevance scoring."""
    pairs = [(query, doc.content) for doc in candidates]

    # Batch scoring for efficiency
    scores = cross_encoder.predict(pairs, batch_size=32)

    # Sort by cross-encoder score
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

Stage 2 takes ~50-100ms.

The Complete Flow

Let's trace a query through the system:

Query: "How should I configure database migrations?"

Stage 1a - BM25 (30ms):

Tokens: ["configure", "database", "migrations"]
Results: [("mem-5", 7.3), ("mem-1", 4.2), ("mem-8", 3.1), ...]

Stage 1b - Vector (200ms):

Embedding: [0.12, -0.34, 0.87, ...] (384 dims)
Results: [("mem-1", 0.92), ("mem-5", 0.87), ("mem-12", 0.78), ...]

Stage 2a - RRF Fusion (5ms):

doc  | bm25_rank | vec_rank | RRF score
-----|-----------|----------|----------
mem-1|     2     |    1     | 0.0325
mem-5|     1     |    2     | 0.0325
mem-8|     3     |    -     | 0.0159
mem-12|    -     |    3     | 0.0159

Stage 2b - Reranking (60ms):

Cross-encoder scores:
mem-1:  0.89 ← Best match (specific migration instructions)
mem-5:  0.87
mem-12: 0.72
mem-8:  0.65

Final Results:

[
    HybridResult(memory_id="mem-1", score=0.89,
                 source_ranks={"bm25": 2, "vector": 1, "reranker": 1}),
    HybridResult(memory_id="mem-5", score=0.87,
                 source_ranks={"bm25": 1, "vector": 2, "reranker": 2}),
    ...
]

Total latency: ~350ms.

Tuning the Weights

Not all retrieval methods are equal for all domains. You can weight them differently in RRF:

# For code/config (keyword-heavy)
retriever = HybridRetriever(bm25_weight=1.5, vector_weight=1.0)

# For conceptual/natural language
retriever = HybridRetriever(bm25_weight=1.0, vector_weight=1.5)

Weighted RRF:

score(d) = w_bm25/(k + rank_bm25(d)) + w_vec/(k + rank_vec(d))

Provenance Tracking

Every result tracks where it came from:

result = HybridResult(
    memory=Memory(...),
    score=0.89,
    memory_id="mem-1",
    source_scores={"bm25": 4.2, "vector": 0.92, "reranker": 0.89},
    source_ranks={"bm25": 2, "vector": 1, "reranker": 1}
)

This lets you debug retrieval issues:

High BM25, low vector → Keyword match, semantic mismatch
High vector, low BM25 → Semantic match, missing keywords
High both, low reranker → False positive (looks relevant but isn't)

Performance Characteristics

Stage	Component	Latency	Notes
1a	BM25 Search	10-50ms	O(n) scoring
1b	Vector Embed	50-200ms	Model inference
1b	Vector Search	20-100ms	Vectorized dot product
2a	RRF Fusion	5-20ms	O(n log n) sort
2b	Reranking	30-100ms	~2ms per candidate
Total		~350-500ms

For latency-sensitive applications, you can disable the cross-encoder:

retriever = HybridRetriever(use_cross_encoder=False)
# Saves ~100ms, loses ~5% accuracy

When to Use HybridRAG

Good fit:

Technical documentation (mixed keywords + concepts)
Code knowledge bases (exact identifiers + semantic queries)
Decision records (ADR references + natural language)
Incident histories (error codes + descriptions)

Overkill for:

Pure keyword search (use BM25 alone)
Pure semantic search (use vectors alone)
Real-time autocomplete (latency-sensitive)

Usage Example

from src.memory.retrieval import create_hybrid_retriever

# Initialize
retriever = create_hybrid_retriever(
    use_cross_encoder=True,
    bm25_weight=1.0,
    vector_weight=1.0,
)

# Index your memories
retriever.index_memories("user-1", memories)

# Retrieve
results, metrics = await retriever.retrieve(
    query="How do I configure Redis caching?",
    user_id="user-1",
    top_k=10,
)

# Use results
for result in results:
    print(f"{result.memory_id}: {result.score:.3f}")
    print(f"  Content: {result.memory.content[:100]}...")

# Monitor performance
print(f"Latency: {metrics.total_time_ms:.0f}ms")
print(f"  Stage 1: {metrics.stage1_time_ms:.0f}ms")
print(f"  Stage 2: {metrics.stage2_time_ms:.0f}ms")

The Results

In our benchmarks, HybridRAG outperforms pure vector search by 50%+ on multi-hop queries while maintaining sub-second latency:

Method	Recall@10	Latency (p95)
BM25 only	0.62	50ms
Vector only	0.71	250ms
HybridRAG	0.89	450ms

Benchmarks: 10K document corpus of technical documentation. Tested on 4-core CPU (no GPU). Embedding model: all-MiniLM-L6-v2. Cross-encoder: ms-marco-MiniLM-L-6-v2. Query set: 500 natural language questions with human-labeled relevance judgments.

The two-stage architecture gives us the best of both worlds: comprehensive candidate generation followed by precise relevance ranking.

Recommended Libraries

Component	Library	Notes
BM25	`rank_bm25` or Elasticsearch	Pure Python or production-grade
Embeddings	`sentence-transformers`	Wide model selection
Vector Index	`faiss` or `hnswlib`	Fast ANN search
Cross-Encoder	`sentence-transformers`	Same library, different models
Orchestration	`langchain` or `haystack`	Optional, adds complexity

Minimal pip install:

pip install sentence-transformers rank-bm25 numpy

Limitations

HybridRAG isn't a silver bullet. Know when it fails:

Very short queries (1-2 words): BM25 dominates; vector adds noise
Highly technical identifiers: UUIDs, hashes, and base64 strings need exact match, not semantic search
Multilingual queries: Embedding model must support target languages
Real-time autocomplete: 450ms is too slow; use BM25 prefix matching instead
Adversarial queries: Prompt injection in documents can poison retrieval

Tuning tips:

Start with k=50 candidates from each source; adjust based on recall/latency tradeoff
If queries are keyword-heavy (code, configs), boost bm25_weight
If queries are conceptual (natural language), boost vector_weight
The reranker is optional—disable it if latency matters more than precision

HybridRAG is part of ADR-003 Phase 3. See the full ADR for implementation details and performance benchmarks.

The Architecture​

Stage 1: Parallel Candidate Generation​

BM25: The Keyword Hunter​

Vector Search: The Semantic Finder​

Running in Parallel​

Stage 2: Fusion + Reranking​

The Naive Approach (Don't Do This)​

Reciprocal Rank Fusion (RRF)​

Cross-Encoder Reranking​

The Complete Flow​

Tuning the Weights​

Provenance Tracking​

Performance Characteristics​

When to Use HybridRAG​

Usage Example​

The Results​

Recommended Libraries​

Limitations​