Skip to main content

ADR-003: Project Intent - Persistent Technical Context for AI Development

Status: Accepted Date: 2025-12-22 Decision Makers: Development Team Owners: @christopherjoseph Version: 6.7 (MaaS Trust Model)

Decision Summary

We adopt a Memory-First architecture using MCP, implementing a phased approach that prioritizes foundations and quick wins before architectural complexity:

  1. Phase 0: Evaluation, governance, and observability foundations
  2. Phase 1: Mem0 for immediate personalization and conversational memory
  3. Phase 2: Context Engineering for optimized retrieval
  4. Phase 3: HybridRAG with Knowledge Graph for complex queries
  5. Phase 4: Advanced features (Hindsight, MaaS) as needed

We reject simple Vector-only RAG as insufficient for code logic and dependency reasoning.


Critical Architecture Decision: Build vs. Integrate (Mem0)

The Question

Should we integrate Mem0 for conversational memory, or build equivalent capabilities on Pixeltable?

CapabilityPixeltableMem0
Vector embeddings
Semantic search
Structured metadata
Graph relationshipsPartial✓ (Mem0g)
User/session scopingManualBuilt-in
Memory extractionManualAutomatic

Council Decision: Build on Pixeltable (Option B)

All 4 models unanimously agreed: Pixeltable must remain the canonical source of truth.

The "Split Brain" Problem

Integrating Mem0 creates two vector databases:

User Query → MCP Server → ???
├→ Pixeltable (code, ADRs, incidents)
└→ Mem0 (conversations, preferences)
└→ Its own vector DB

Problems this creates:

  • Two sources of truth with different consistency models
  • Cross-system queries for "What did we decide about auth?" are complex
  • Different score scales make result merging difficult
  • Debugging spans two systems with different logging

Why Pixeltable Can Replace Mem0

Mem0's primary value (extraction + scoping) can be replicated using Pixeltable's computed columns:

# Memory extraction via LLM UDF (computed column)
@pt.computed_column
def extract_memory_facts(row) -> list[dict]:
"""Extract memorable facts from conversation."""
prompt = """
Analyze this conversation and extract:
1. User preferences (explicit and implicit)
2. Decisions made
3. Facts learned about the codebase
4. Corrections to previous understanding
Return as structured JSON with confidence scores.
"""
return llm_extract(prompt, row.content)

# User/session scoping
@pt.computed_column
def memory_scope(row) -> dict:
return {
"user_id": row.user_id,
"project_id": row.project_id,
"scope_hierarchy": [
f"user:{row.user_id}",
f"project:{row.project_id}",
"global"
]
}

# Memory consolidation (dedup, merge, supersede)
class MemoryConsolidator:
async def consolidate(self, new_memory: Memory) -> ConsolidationResult:
similar = await self.find_similar(new_memory, threshold=0.85)
if not similar:
return ConsolidationResult(action="insert", memory=new_memory)
# Handle contradictions, merges, supersedes...

Decision Matrix

CriteriaA) Integrate Mem0B) Build on Pixeltable
Time to MVP2-3 weeks4-6 weeks
Maintenance burdenHigh (two systems)Low (one system)
Feature velocityFast initially, slowsSlower initially, compounds
Architectural simplicityPoorExcellent
Long-term flexibilityLimited by Mem0's roadmapFull control
DebuggingHigh complexityLow complexity
Data consistencyTwo sources of truthSingle source

Accepted Trade-offs

  • Longer initial development (6 weeks vs 2-3 weeks)
  • Must build extraction/consolidation logic ourselves
  • No access to Mem0's ongoing R&D

Validation Criteria (Phase 1 Exit)

  • Memory retrieval latency < 200ms p95
  • Extraction accuracy > 80% (manual evaluation)
  • Zero cross-user memory leakage

Context

The Problem: Ephemeral AI Context

Large Language Models (LLMs) suffer from a fundamental limitation: context window amnesia. Each conversation starts fresh, forcing developers to repeatedly explain:

  • Project architecture and design decisions
  • Coding conventions and patterns
  • Past incidents and their resolutions
  • Domain-specific terminology
  • Team decisions and their rationale

This creates three significant pain points:

  1. Repetitive Context Loading: Developers waste time re-establishing context every session
  2. Lost Institutional Knowledge: Valuable decisions and learnings evaporate between conversations
  3. Inconsistent Assistance: Without historical context, AI suggestions may contradict past decisions

The Vision: Persistent Technical Memory

Luminescent Cluster aims to give AI assistants persistent technical memory - the ability to recall project context, architectural decisions, incident history, and codebase knowledge across sessions and even across different LLM providers.

Industry Context (December 2025)

The LLM memory landscape has evolved rapidly. Key developments:

  • MCP Standardization: The Model Context Protocol is now the de-facto standard, adopted by OpenAI, Google DeepMind, Microsoft, and AWS. In December 2025, Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation (AAIF).
  • Beyond RAG: Traditional RAG is being challenged by agentic memory architectures that maintain context over time, track evolving beliefs, and perform temporal reasoning.
  • Production Scale: Systems like Mem0 have achieved 26% accuracy improvements with 91% lower latency and 90% token savings at enterprise scale (186M API calls/Q3 2025).
  • Context Engineering: A new discipline focused on "the delicate art and science of filling the context window with just the right information" (Andrej Karpathy).

Decision Drivers (Ranked)

PriorityDriverWeightRationale
1Developer Experience30%Reduce repetition friction; faster onboarding
2Accuracy25%Retrieved context must be relevant and correct
3Token Efficiency20%Context window is expensive; minimize waste
4Implementation Velocity15%Need wins this quarter to validate approach
5Future Flexibility10%Don't lock in prematurely; enable pivots

Non-Goals

This ADR explicitly does not address:

  • Multi-tenant implementation details: Multi-tenancy is technically supported via the Extension Registry (ADR-005) and unified in the integration layer (ADR-007). This ADR remains focused strictly on memory architecture; tenant isolation is treated here as an external constraint rather than a core feature
  • General documentation RAG: Focus is on technical context, not help docs
  • AGI-style continuous learning: Memory is curated, not autonomous
  • PII/secrets storage: Sensitive data excluded by policy
  • Training custom models: Memory augments existing LLMs, doesn't train new ones

Success Metrics (Quantified)

MetricBaselineTargetMeasurement
Context re-explanation frequency~5/session<1/sessionUser survey
Context retrieval latencyN/A<500ms p95Instrumentation
Retrieval precision@5N/A>85%LongMemEval subset
Token efficiencyN/A<30% of window for memoryContext analysis
Multi-hop query accuracyN/A>75%Internal benchmark
Developer satisfaction (NPS)N/A>50Quarterly survey

Decision

We implement a three-tier memory architecture exposed via the Model Context Protocol (MCP):

Tier 1: Session Memory (Hot Context)

Purpose: Fast access to current development state Implementation: session_memory_server.py Data Sources:

  • Git repository state (current branch, status)
  • Recent commits (last 200)
  • Current diff (staged/unstaged changes)
  • File change history
  • Active task context (set by user/agent)

Characteristics:

  • Latency: <10ms (in-memory)
  • Scope: Current repository
  • Persistence: Ephemeral (session-bound)
  • Cost: Zero (local computation)

Tier 2: Long-term Memory (Persistent Knowledge)

Purpose: Semantic search over organizational knowledge Implementation: pixeltable_mcp_server.py + pixeltable_setup.py Data Sources:

  • Code repositories (multi-service)
  • Architectural Decision Records (ADRs)
  • Production incident history
  • Meeting transcripts
  • Documentation

Characteristics:

  • Latency: 100-500ms (semantic search)
  • Scope: Entire organization, cross-project
  • Persistence: Durable (survives restarts)
  • Cost: Embedding generation (local sentence-transformers by default)

Tier 3: Intelligent Orchestration

Purpose: Efficient multi-tool coordination Mechanisms:

  • Tool Search: On-demand tool discovery (85% token reduction)
  • Programmatic Tool Calling: Batch operations in sandbox (37% token reduction)
  • Deferred Loading: Heavy tools loaded only when needed

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ AI Assistant (Claude, etc.) │
├─────────────────────────────────────────────────────────────────┤
│ Model Context Protocol (MCP) │
├────────────────┬────────────────────────┬───────────────────────┤
│ Tier 1 │ Tier 2 │ Tier 3 │
│ Session Memory │ Long-term Memory │ Orchestration │
├────────────────┼────────────────────────┼───────────────────────┤
│ • Git state │ • Pixeltable DB │ • Tool Search │
│ • Recent commits│ • Semantic embeddings │ • Programmatic Calls │
│ • Current diff │ • Multi-project index │ • Deferred Loading │
│ • Task context │ • ADRs, incidents │ │
└────────────────┴────────────────────────┴───────────────────────┘

Interface Contract (MCP)

Implementation Note: These interface definitions represent the conceptual architectural vision. For live, versioned protocol signatures (including ContextStore, TenantProvider, and AuditLogger), refer to src/extensions/protocols.py as consolidated in ADR-007. Chatbot-specific implementations of these interfaces are provided by the adapters defined in ADR-006.

The memory system exposes the following MCP resources and tools:

Resources (Read-Only Context)

project://memory/recent_decisions     # Last N architectural decisions
project://memory/active_incidents # Open incidents affecting this service
project://memory/conventions # Coding patterns and team standards
project://memory/dependency_graph # Service relationships (Phase 3+)

Tools (Actions)

# Session Memory Tools
set_task_context(task: str, details: dict) # Set current work context
get_task_context() -> TaskContext # Retrieve current context
search_commits(query: str) -> list[Commit] # Search commit history

# Long-term Memory Tools
semantic_search(query: str, limit: int) -> list[Result]
ingest_code(path: str, service: str) # Index codebase
ingest_adr(path: str, service: str) # Index decision record
ingest_incident(summary: str, severity: str, lessons: str)

# Memory Management (Phase 0+)
update_memory(key: str, value: Any, source: str) # With provenance
invalidate_memory(key: str, reason: str) # Explicit expiration
get_memory_provenance(key: str) -> Provenance # Audit trail

Key Design Principles

1. LLM-Agnostic via MCP

The system uses the Model Context Protocol (MCP), making it portable across:

  • Claude (via Claude Code)
  • OpenAI ChatGPT (MCP support added March 2025)
  • Google Gemini (MCP support announced April 2025)
  • Custom agents via programmatic access

2. Semantic Search over Keyword Matching

Long-term memory uses sentence-transformer embeddings for semantic similarity:

# Find conceptually related code, not just keyword matches
"authentication flow" → finds OAuth, JWT, session handling code

3. Multi-Project Awareness

Knowledge is indexed by service_name, enabling:

  • Cross-project searches ("How does auth-service handle tokens?")
  • Project-specific filtering ("Show incidents for payment-api only")
  • Organizational patterns ("What database patterns do we use?")

4. Automatic Embedding Maintenance

Pixeltable's computed columns automatically recompute embeddings when content changes:

# Embeddings stay in sync - no manual refresh needed
embedding=kb.content.apply(embed_text)

5. Defense in Depth (Python Version)

Per ADR-001, the system includes 7 layers of protection against Python version mismatch issues that could corrupt the Pixeltable database.

Use Cases

1. Architectural Continuity

User: "Why did we choose PostgreSQL over MongoDB?"
AI: [Queries ADRs] "ADR-005 documents this decision from March 2024..."

2. Incident-Aware Development

User: "I'm adding rate limiting to the auth service"
AI: [Queries incidents] "Note: We had an outage in November due to rate limiter
misconfiguration. The post-mortem recommended..."

3. Cross-Session Context

Session 1: User sets task context "Implementing OAuth2 PKCE flow"
Session 2: AI recalls task context and relevant ADRs automatically

4. Codebase Navigation

User: "How do we handle database connections?"
AI: [Semantic search] "Based on auth-service/db/pool.py and the connection
pooling ADR, you use..."

Rationale

Why Not Just Use RAG?

Traditional RAG (Retrieval Augmented Generation) typically:

  • Requires manual chunk management
  • Needs explicit embedding refresh
  • Lacks structured metadata (service, type, severity)
  • Doesn't integrate with development workflow

Luminescent Cluster adds:

  • Computed columns: Auto-updating embeddings
  • Typed knowledge: Code vs ADR vs incident with different schemas
  • Git integration: Session memory tied to repository state
  • MCP exposure: Native tool integration with AI assistants

Why MCP over Custom APIs?

  • Standardized: Works with any MCP-compatible client
  • Discoverable: Tools are self-documenting
  • Composable: Clients can orchestrate multiple MCP servers
  • Future-proof: Now backed by Linux Foundation with industry-wide adoption

Why Pixeltable?

  • Computed columns: Embeddings auto-update on content change
  • Multimodal ready: Can extend to images, videos
  • Snapshot/restore: Built-in versioning for knowledge base
  • Python-native: Fits development workflow

Improvement Options (December 2025 Research)

Based on current industry developments, the following enhancement options are presented for council review:

Option A: HybridRAG - Knowledge Graph Integration

What: Combine vector embeddings with a knowledge graph for multi-hop reasoning.

Industry Evidence:

  • Microsoft Research shows 2.8x accuracy improvement with hybrid approaches
  • GraphRAG (Microsoft) constructs knowledge graphs from unstructured data
  • Cedars-Sinai's AlzKB demonstrates real-world HybridRAG success

Implementation:

┌─────────────────────────────────────────────────────────────────┐
│ HybridRAG Architecture │
├────────────────────────┬────────────────────────────────────────┤
│ Vector Search │ Graph Traversal │
│ (Pixeltable) │ (Neo4j/Memgraph) │
├────────────────────────┼────────────────────────────────────────┤
│ • Semantic similarity │ • Entity relationships │
│ • Fuzzy matching │ • Multi-hop reasoning │
│ • Fast retrieval │ • Causal chains │
└────────────────────────┴────────────────────────────────────────┘


Reciprocal Rank Fusion


Neural Reranker

Benefits:

  • Multi-hop reasoning: "What services depend on the auth module that had the incident?"
  • Explicit relationships: Code → ADR → Incident → Resolution chains
  • Better temporal reasoning: Track how decisions evolved

Tradeoffs:

  • Additional infrastructure (graph database)
  • Data modeling complexity
  • Sync between vector and graph stores

Effort: Medium-High Impact: High for complex organizational queries


Option B: Mem0 Integration

What: Integrate Mem0's production-proven memory layer alongside Pixeltable.

Industry Evidence:

  • 26% accuracy improvement, 91% lower p95 latency, 90% token savings
  • 41K GitHub stars, 14M downloads, 186M API calls/quarter
  • AWS chose Mem0 as exclusive memory provider for Agent SDK
  • SOC 2 & HIPAA compliant

Architecture:

┌─────────────────────────────────────────────────────────────────┐
│ Memory Layer │
├────────────────────────┬────────────────────────────────────────┤
│ Pixeltable │ Mem0 │
│ (Long-term KB) │ (Conversational Memory) │
├────────────────────────┼────────────────────────────────────────┤
│ • Codebase index │ • User preferences │
│ • ADRs & incidents │ • Conversation facts │
│ • Static knowledge │ • Dynamic learning │
│ • Service-scoped │ • User/session-scoped │
└────────────────────────┴────────────────────────────────────────┘

Benefits:

  • Production-proven scale and performance
  • Graph-based memory variant (Mem0g) for relational knowledge
  • Three-line integration
  • Enterprise compliance built-in

Tradeoffs:

  • External dependency (cloud or self-hosted)
  • Potential overlap with Pixeltable functionality
  • Additional cost for hosted version

Effort: Low-Medium Impact: High for personalization and learning


Option C: Hindsight Agentic Memory

What: Adopt the four-network memory architecture that achieved 91.4% on LongMemEval.

Industry Evidence:

  • Highest accuracy on LongMemEval benchmark (December 2025)
  • Designed specifically for agents needing temporal/causal reasoning
  • TEMPR retrieval: semantic + keyword + graph + temporal filtering

Four Network Architecture:

┌──────────────────────────────────────────────────────────────────┐
│ Hindsight Memory Networks │
├─────────────────┬─────────────────┬─────────────────┬────────────┤
│ World │ Bank │ Opinion │ Observation│
│ Network │ Network │ Network │ Network │
├─────────────────┼─────────────────┼─────────────────┼────────────┤
│ External facts │ Agent's own │ Subjective │ Neutral │
│ about the world │ experiences & │ judgments with │ entity │
│ │ actions │ confidence │ summaries │
└─────────────────┴─────────────────┴─────────────────┴────────────┘

Mapped to Luminescent Cluster:

  • World Network: Codebase structure, API contracts, dependencies
  • Bank Network: What the AI has done, commits made, PRs reviewed
  • Opinion Network: Code quality assessments, architecture preferences
  • Observation Network: Service summaries, team conventions

Benefits:

  • Separates facts from opinions (prevents hallucination bleed)
  • Tracks agent's own actions (audit trail)
  • Temporal filtering for time-sensitive queries
  • Open source (Apache 2.0)

Tradeoffs:

  • Significant architectural change
  • New data model required
  • Less mature than Mem0 (newer project)

Effort: High Impact: Very High for long-horizon agent tasks


Option D: Context Engineering Enhancements

What: Implement advanced context management strategies.

Industry Evidence:

  • ACE framework: +10.6% on agents, +8.6% on finance benchmarks
  • Google ADK's Context Compaction reduces latency/tokens
  • Anthropic found isolated contexts outperform single-agent approaches

Strategies:

  1. Memory Blocks: Structure context into discrete functional units
┌─────────────────────────────────────────────────────────────────┐
│ Memory Block Layout │
├─────────────────────────────────────────────────────────────────┤
│ [System Block] │ Core instructions, persona │
│ [Project Block] │ Current project context, conventions │
│ [Task Block] │ Active task, goals, constraints │
│ [History Block] │ Compressed conversation history │
│ [Knowledge Block] │ Retrieved ADRs, incidents, code │
└─────────────────────────────────────────────────────────────────┘
  1. Context Compaction: Auto-summarize when threshold reached
  2. Selective Retrieval: Pull only relevant knowledge per query
  3. Context Isolation: Split complex tasks across subagents

Benefits:

  • Reduces "context rot" from oversized windows
  • Clear separation of concerns
  • Enables monitoring and debugging
  • Works with existing infrastructure

Tradeoffs:

  • Requires client-side coordination
  • MCP servers may not have visibility into full context
  • Summarization can lose nuance

Effort: Medium Impact: Medium-High for efficiency


Option E: Memory as a Service (MaaS)

What: Shift from agent-bound memory to shared memory services for multi-agent collaboration.

Industry Evidence:

  • Research shows agent memory silos hinder collaboration
  • MaaS paradigm emerging for multi-agent systems
  • Enables organizational memory shared across tools/agents

Architecture:

┌─────────────────────────────────────────────────────────────────┐
│ Memory as a Service (MaaS) │
├─────────────────────────────────────────────────────────────────┤
│ Shared Memory Layer │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Code KB │ │ Decision │ │ Incident │ │
│ │ Service │ │ Service │ │ Service │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
├─────────┼──────────────┼──────────────┼──────────────────────────┤
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Claude │ │ GPT Agent │ │ Custom │ │
│ │ Code │ │ │ │ Pipeline │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────────┘

Benefits:

  • Memory persists regardless of which agent/LLM is used
  • Enables handoff between specialized agents
  • Organizational knowledge accessible to all tools
  • Natural fit for MCP's server architecture

Tradeoffs:

  • Access control complexity
  • Consistency across agents
  • Security considerations (MEXTRA attack vulnerabilities)

Effort: Medium Impact: High for multi-agent workflows


Option F: LangMem Integration

What: Integrate LangChain's LangMem SDK for procedural, episodic, and semantic memory types.

Industry Evidence:

  • Native integration with LangGraph (popular agent framework)
  • DeepLearning.AI course validates approach
  • MongoDB, Redis integrations available
  • Part of LangChain's production ecosystem

Memory Types:

┌─────────────────────────────────────────────────────────────────┐
│ LangMem Memory Types │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ Procedural │ Episodic │ Semantic │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ How to do tasks │ Specific events │ General facts │
│ (coding styles, │ (this PR, that │ (architecture, │
│ conventions) │ incident) │ team knowledge) │
└─────────────────┴─────────────────┴─────────────────────────────┘

Benefits:

  • Well-documented, production-ready
  • Multiple persistence backends
  • Checkpointing and time-travel
  • Large community support

Tradeoffs:

  • LangChain dependency
  • May not integrate directly with MCP
  • Overlaps with Pixeltable functionality

Effort: Medium Impact: Medium for LangGraph users


Options Comparison Matrix

Weighted Decision Matrix

Criterion (Weight)A: HybridRAGB: Mem0C: HindsightD: Context EngE: MaaSF: LangMem
Accuracy (25%)★★★★★★★★☆☆★★★★★★★☆☆☆★★★☆☆★★★☆☆
Dev Experience (30%)★★★☆☆★★★★★★★★☆☆★★★★☆★★★★☆★★★★☆
Effort/Risk (20%)★☆☆☆☆★★★★☆★☆☆☆☆★★★★★★★★☆☆★★★★☆
Maturity (15%)★★★☆☆★★★★☆★★☆☆☆★★★★★★★★☆☆★★★★☆
Flexibility (10%)★★★★☆★★★☆☆★★★☆☆★★★★★★★★★☆★★★★☆
Weighted Score2.953.702.803.803.353.60

Quantitative Metrics (Hypotheses)

OptionAccuracy GainLatency ImpactEffortInfrastructureBest For
A. HybridRAG+180% (2.8x)*+50msHighGraph DBMulti-hop code queries
B. Mem0+26%-91% p95Low-MedCloud/Self-hostUser personalization
C. Hindsight+40%**NeutralHighNew data modelLong-horizon agents
D. Context Eng+10-30%-30%MediumNoneImmediate optimization
E. MaaSN/ANeutralMediumAPI layerMulti-agent workflows
F. LangMemModerateNeutralMediumLangChainLangGraph integration

*Based on Microsoft GraphRAG research (2024); requires validation on code domains **Estimated from LongMemEval 91.4% vs ~65% baseline RAG


Additional Industry Developments (Council Additions)

Based on council feedback, the following December 2025 developments were identified as missing:

Option G: Context Caching (Provider-Side)

What: Leverage provider-side context caching (Anthropic, OpenAI, Google) to avoid re-sending stable project context.

Industry Evidence:

  • By late 2025, all major providers offer context caching
  • Cost reduction: 75-90% for repeated project context
  • Latency reduction: Eliminate round-trip for cached prefixes

Implementation:

┌─────────────────────────────────────────────────────────────────┐
│ Context Caching Flow │
├─────────────────────────────────────────────────────────────────┤
│ [Stable Context] │ [Dynamic Context] │
│ • Project architecture │ • Current task │
│ • ADRs & conventions │ • Recent code changes │
│ • Team patterns │ • User query │
│ ↓ │ ↓ │
│ CACHED (TTL: 1 hour) │ SENT FRESH │
└─────────────────────────────────────────────────────────────────┘

Effort: Low Impact: High for cost/latency


Option H: Episodic Rollback ("Git for Memory")

What: Version control for memory state, enabling rollback when agents go wrong.

Industry Evidence:

  • Developer expectation from git workflows
  • Critical for debugging agent mistakes
  • Enables "what if" exploration

Implementation:

# Memory checkpoint API
checkpoint = memory.create_checkpoint("before-refactor")
# ... agent makes decisions ...
memory.rollback_to(checkpoint) # Undo bad decisions
memory.diff(checkpoint, "HEAD") # Compare states

Effort: Medium Impact: High for agent reliability


Option I: Provenance & Governance

What: Every memory item carries source links, timestamps, confidence, and validity scope.

Industry Evidence:

  • Enterprise requirement for audit trails
  • Prevents hallucination from unverified sources
  • Enables "why does the AI think this?" debugging

Schema:

@dataclass
class MemoryItem:
content: str
source: str # "PR #402", "ADR-005", "user:chris"
created_at: datetime
expires_at: datetime # TTL-based expiration
confidence: float # 0.0-1.0
verified_by: str # Human approval if required

Effort: Medium Impact: High for enterprise trust


Option J: Retrieval-Augmented Thoughts (RAT)

What: Interleave retrieval during chain-of-thought, not just before generation.

Industry Evidence:

  • Research shows improved reasoning with mid-thought retrieval
  • Addresses "lost in the middle" problem
  • More accurate for complex technical queries

Flow:

Traditional RAG:  [Retrieve] → [Think] → [Respond]
RAT: [Think] → [Retrieve] → [Think more] → [Retrieve] → [Respond]

Effort: High (requires reasoning model integration) Impact: High for complex queries


Option K: Temporal Memory Decay (Forgetting Curve)

What: Implement forgetting mechanisms so old/unused memories decay in relevance.

Industry Evidence:

  • "Relevance pollution" is a real problem at scale
  • Old decisions may conflict with new ones
  • Memory systems need pruning strategies

Implementation:

# Decay function
relevance = base_relevance * exp(-λ * days_since_access)
# Re-access refreshes relevance
# Low-relevance items deprioritized in retrieval

Effort: Low-Medium Impact: Medium for long-term maintenance

Critical Change: The council unanimously agreed the original roadmap was "Optimization before Foundation." The revised roadmap prioritizes quick wins and foundations before heavy architectural lifts.

Phase 0: Foundations (Required First)

Purpose: Cannot optimize what you cannot measure

Deliverables:

  1. Evaluation Harness

    • Fixed task set + automated scoring
    • Retrieval quality metrics (precision@k, citation correctness)
    • Latency/cost instrumentation
    • Contradiction/hallucination tests
  2. HNSW Recall Health Monitoring (Council Addition - January 2026)

    Critical Risk: HNSW approximate search silently degrades as the database grows. No errors are raised—the system appears healthy while retrieval quality deteriorates.

    RequirementTargetNotes
    Recall@k MetricMeasured against brute-force exact searchGolden query set (50 queries)
    Absolute ThresholdRecall@10 ≥ 0.90Alert if below
    Relative Drift≤ 5% drop from baselineAlert on regression
    Filtered SearchEvaluate with tenant/tag filtersPrevents "filter-induced recall collapse"
    Reindex TriggerAuto-VACUUM when Recall < thresholdAutomated maintenance

    Retuning Milestones:

    • 10k items: Benchmark and log
    • 50k items: Benchmark, alert, consider ef_search tuning
    • 100k items: Mandatory review, consider index rebuild

    Embedding Versioning (Council Required):

    • Version-tag all embeddings with model identifier
    • On model change: flag for re-embedding
    • Index_v1 cannot serve traffic for Model_v2
  3. Memory Schema & Lifecycle

    • Define memory types (decisions, facts, procedures, preferences)
    • TTL and expiration policies
    • Versioning strategy (Option H foundation)
    • "Source of truth" policy
  4. Governance & Observability

    • Trace: retrieval → context assembly → model output
    • Log which memories were used and why
    • Audit trail for memory changes
    • Access control framework

Exit Criteria: Baseline metrics established, governance policies documented


Phase 1: Conversational Memory (Pixeltable Native) - 8 WEEKS

Decision: Build on Pixeltable (not Mem0 integration - see Architecture Decision above)

Purpose: Unified memory store with extraction capabilities

Critical Architecture: The "Janitor" Pattern (Council Mandated)

Running LLM extraction synchronously kills latency. We adopt a tiered approach:

┌─────────────────────────────────────────────────────────────────────────────┐
│ HOT / WARM / COLD MEMORY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TIER 1: HOT MEMORY (Real-time) │
│ ───────────────────────────── │
│ • Raw chat history retrieval │
│ • No extraction cost │
│ • Latency: <50ms │
│ │
│ TIER 2: WARM MEMORY (Async Extraction) │
│ ────────────────────────────────────── │
│ • Pixeltable computed columns with SMALL model (Llama-3-8B, Haiku) │
│ • Extraction runs AFTER response sent to user │
│ • Latency: Background (no user impact) │
│ │
│ TIER 3: COLD MEMORY (Scheduled Consolidation) │
│ ───────────────────────────────────────────── │
│ • "Janitor Process" - nightly batch job │
│ • Uses REASONING model (GPT-4o, Opus) for complex dedup │
│ • Merges facts, resolves contradictions, expires old data │
│ • Latency: Nightly (no user impact) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Deliverables (Extended to 8 Weeks per Council):

  1. Phase 1a: Storage & Hot Memory (Weeks 1-2)

    • Add user_memory table to Pixeltable schema
    • Add conversation_memory table with TTL support
    • Implement user/project scoping
    • Basic CRUD through MCP tools
    • Raw retrieval (Hot Memory tier)
  2. Phase 1b: Async Extraction (Weeks 3-4)

    • Create extract_memory_facts() UDF using small model (Haiku/Llama-3-8B)
    • Async execution: Extract AFTER response sent
    • Confidence scoring and thresholds
    • Extraction determinism: temperature=0
    • Store raw source alongside extracted facts for re-processing
  3. Phase 1c: Retrieval & Ranking (Weeks 5-6)

    • Ranking logic: Hot facts vs Extracted facts
    • Query rewriting for memory search
    • Scope-aware retrieval (user vs project vs global)
  4. Phase 1d: Janitor Process (Weeks 7-8)

    • Nightly consolidation job using reasoning model (GPT-4o)
    • Basic deduplication (>85% similarity threshold)
    • Simple contradiction handling: newer wins, flag for review
    • Memory decay: reduce relevance of unaccessed items
    • Defer complex contradiction resolution to Phase 2

Schema Definition:

# Pixeltable tables for conversational memory
user_memory = pt.Table(
"user_memory",
columns={
"user_id": pt.String,
"content": pt.String,
"memory_type": pt.String, # preference, fact, decision
"confidence": pt.Float,
"source": pt.String, # conversation_id, manual
"raw_source": pt.String, # Original text for re-extraction
"extraction_version": pt.Int, # For re-processing on prompt updates
"created_at": pt.Timestamp,
"last_accessed_at": pt.Timestamp, # For decay scoring
"expires_at": pt.Timestamp, # TTL support
},
computed_columns={
"embedding": embed_text(content),
"scope": memory_scope(user_id, project_id),
}
)

Validation Metrics (Council Required):

MetricTargetMeasurement
Hot memory latency<50ms p95Instrumentation
Extraction precision>85%Golden Dataset eval
Retrieval relevance>90% rated "helpful"Blind evaluation
Storage efficiency<10% duplicates post-consolidationAutomated check
Query latency<200ms p95Instrumentation

Golden Dataset (Council Required): Create 50 static questions representing real project queries:

  • "What database do we use?"
  • "Why did we reject MongoDB?"
  • "What's our logging convention?"
  • "Who approved the Kafka decision?"

Run regression tests against Golden Dataset on every PR.

Exit Criteria:

  • 85% accuracy on Golden Dataset

  • Zero cross-user memory leakage
  • Janitor process completes in <10 minutes for 10k memories

Decision Reversal Trigger: If Week 4 checkpoint shows extraction precision <70%, evaluate:

  1. Scoping down consolidation features
  2. Revisiting Mem0 for extraction-only (not storage)

Phase 2: Context Engineering Optimization

Options: D (Context Engineering) + I (Provenance) + K (Temporal Decay)

Purpose: Optimize retrieval and context assembly

Deliverables:

  1. Memory Blocks Architecture

    [System Block]     │ Core instructions, persona
    [Project Block] │ Current project context, conventions
    [Task Block] │ Active task, goals, constraints
    [History Block] │ Compressed conversation history
    [Knowledge Block] │ Retrieved ADRs, incidents, code
  2. Retrieval Improvements

    • Query rewriting for better matches
    • Reranking layer for relevance
    • Selective injection based on query type
    • Contextual compression for long memories
  3. Provenance Tracking

    • Source links for all memories
    • Confidence scoring
    • Temporal decay implementation
  4. Grounded Memory Ingestion (Council Addition - January 2026)

    Risk: The /memorize command can pollute long-term memory with hallucinations or unsourced claims. "Garbage in, garbage forever."

    Tiered Provenance Model (Council Recommended over NLI):

    ┌─────────────────────────────────────────────────────────────────┐
    │ GROUNDED INGESTION TIERS │
    ├─────────────────────────────────────────────────────────────────┤
    │ TIER 1: Auto-approve (high confidence) │
    │ • Content with explicit ADR/commit/doc links │
    │ • User-stated facts about their own project │
    │ • Decision discussions with clear context │
    ├─────────────────────────────────────────────────────────────────┤
    │ TIER 2: Flag for review (medium confidence) │
    │ • AI-synthesized claims without citations │
    │ • Factual assertions about external systems/APIs │
    │ → Queue for user confirmation before promotion │
    ├─────────────────────────────────────────────────────────────────┤
    │ TIER 3: Block (low confidence) │
    │ • Speculative content ("maybe", "might", "could be") │
    │ • Content that contradicts existing memory │
    │ → Reject with explanation, surface conflicts │
    └─────────────────────────────────────────────────────────────────┘

    Evidence Object Schema:

    class EvidenceObject:
    claim: str # The memory content
    source_id: Optional[str] # ADR-XXX, commit hash, URL
    capture_time: datetime # When captured
    validity_horizon: Optional[datetime] # Expiration if time-bound
    confidence: Literal["high", "medium", "low"]

    Validation Checks (lightweight, no external model calls):

    • Citation presence: regex for [ADR-XXX], commit hashes, URLs
    • Hedge word detection: reject speculative language
    • Deduplication: cosine similarity >0.92 rejects as redundant
    • Contradiction detection: deferred to Phase 3 (requires reasoning)

Exit Criteria:

  • 30% token efficiency improvement
  • Provenance available for all retrieved items
  • Stale memory detection operational
  • Zero hallucination write-back in grounded ingestion tests

Phase 3: Knowledge Structure (HybridRAG)

Options: A (HybridRAG with Knowledge Graph)

Purpose: Multi-hop reasoning for complex code queries

Prerequisites: Stable data from Phase 1-2, validated need for graph queries

Deliverables:

  • Knowledge graph for code dependencies
  • Entity extraction pipeline (async, not blocking)
  • Reciprocal Rank Fusion (vector + graph)
  • Neural reranker for final results

Hybrid Search Architecture (Council Addition - January 2026)

Intent: Combine vector similarity with graph relationships and keyword matching. Specific parameters to be tuned based on Phase 2 learnings.

┌─────────────────────────────────────────────────────────────────────┐
│ TWO-STAGE RETRIEVAL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ STAGE 1: Candidate Generation (parallel) │
│ • Vector Similarity (Dense) │
│ • Keyword BM25 (Sparse) │
│ • Knowledge Graph Traversal (Relationship-based) │
├─────────────────────────────────────────────────────────────────────┤
│ STAGE 2: Fusion + Reranking │
│ • Reciprocal Rank Fusion (RRF): Merge ranked lists │
│ • Cross-Encoder Reranker: Score top candidates on relevance │
│ • Return top 5 to context window │
└─────────────────────────────────────────────────────────────────────┘

Architectural Decisions (to be validated in Phase 2):

ComponentIntentParameters TBD
RRF FormulaΣ 1/(k + rank_i)k value (standard: 60)
Multi-Query ExpansionGenerate semantic variantsMax 3 variants (cost cap)
Candidate PoolBroad retrieval before filteringTop 50-100 candidates
RerankerCross-encoder for final orderingModel choice TBD

Phase 2 Learnings Required Before Implementation:

  • What percentage of queries require graph traversal?
  • What's keyword search hit rate vs vector search?
  • Where are retrieval failures occurring? (measure first, optimize second)

Target Queries (justify the investment):

"What services depend on auth that had incidents last month?"
"Show me all code that changed because of ADR-005"
"Which team owns the service that calls the failing endpoint?"

Exit Criteria:

  • Multi-hop queries outperform pure vector by >50%
  • Latency <1s for graph-augmented queries
  • Entity extraction runs async (no chat blocking)

Phase 4: Advanced Capabilities

Options: C (Hindsight), E (MaaS), J (RAT)

Purpose: Future-state capabilities based on validated needs

Conditional Entry:

  • Hindsight (C): Only if Phase 3 reveals need for temporal/causal reasoning
  • MaaS (E): Only if multi-agent architecture is adopted
  • RAT (J): Only if complex reasoning queries dominate usage

Approach: Cherry-pick techniques from Hindsight rather than wholesale adoption

Phase 4.2: MaaS Trust Model (Council-Documented Decision)

Context: The LLM Council raised design-level concerns during security verification:

  • Callers can specify arbitrary capabilities
  • Agents can join pools by knowing the pool ID
  • Arbitrary owner_id assignment without verification

Decision: These are intentional architectural decisions, not security gaps.

Trust Boundary: MaaS APIs are internal interfaces, not public-facing endpoints.

┌─────────────────────────────────────────────────────────────────────────────┐
│ TRUST BOUNDARY MODEL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ EXTERNAL (Untrusted) │ INTERNAL (Trusted) │
│ ───────────────────── │ ──────────────────── │
│ • End users │ • MCP Server layer │
│ • External APIs │ • CLI orchestrator │
│ • Network requests │ • Extension Registry │
│ │ • MaaS Registries │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌───────────────┐ │ ┌───────────────┐ │
│ │ Auth Layer │─────────────┼─▶│ Orchestrator │ │
│ │ (MCP Server) │ │ │ (Trusted) │ │
│ └───────────────┘ │ └───────────────┘ │
│ │ │ │
│ Authentication happens HERE │ ▼ │
│ │ ┌───────────────┐ │
│ │ │ MaaS APIs │ │
│ │ │ (No auth) │ │
│ │ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

What MaaS APIs Assume (Trusted Orchestrator Pattern):

AssumptionRationale
owner_id is verifiedOrchestrator authenticated the user before calling MaaS
Capabilities are appropriateOrchestrator determines agent permissions based on auth context
Pool IDs are authorizedOrchestrator controls which pool IDs are exposed to which agents
Agent IDs are validOrchestrator manages agent lifecycle and provides valid IDs

What MaaS APIs Enforce (Defense in Depth):

EnforcementImplementation
Capability checksAgentIdentity.has_capability() - agents can only perform allowed actions
Scope hierarchySharedScope enum - agents cannot read above their max scope
Capacity limitsRegistryCapacityError, PoolCapacityError, HandoffCapacityError
Audit loggingAll operations logged via MaaSAuditLogger for forensics
ID entropy128-bit UUIDs prevent ID guessing attacks
Defensive copiesAll inputs/outputs copied to prevent state mutation
Handoff lifecycleState machine prevents invalid transitions

What MaaS APIs Do NOT Enforce:

Not EnforcedWhy
User authenticationHandled by MCP server / CLI layer
owner_id verificationOrchestrator responsibility (has auth context)
Pool membership authorizationOrchestrator controls pool ID exposure
Capability assignment policyBusiness logic varies by deployment

Rationale: This follows the same pattern as other internal registries:

  • ExtensionRegistry (src/extensions/registry.py) - no auth checks
  • LocalMemoryProvider (src/memory/providers/local.py) - trusts user_id
  • MemoryStore - trusts caller-provided identifiers

Security Model Summary:

Authentication:     MCP Server / CLI Layer (EXTERNAL)
Authorization: Orchestrator Layer (decides what to call)
Capability Checks: MaaS API Layer (enforces what agents CAN do)
Audit Trail: MaaS Audit Logger (records what DID happen)

Council Verification: This trust model was documented following LLM Council feedback during v0.3.0 security hardening. The council's concerns were valid observations about the API surface, but represent intentional design decisions for internal interfaces rather than security vulnerabilities.


Roadmap Visualization

┌─────────────────────────────────────────────────────────────────────────────┐
│ ROADMAP TIMELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 │
│ Foundations Mem0 Context Eng HybridRAG Advanced │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Eval │ │ Memory │ │ Blocks │ │ Graph │ │Hindsight│ │
│ │ Schema │─────▶│ Policy │─────▶│ Rerank │─────▶│ Vector │───▶│ MaaS │ │
│ │ Govern │ │ Cache │ │ Decay │ │ Fusion │ │ RAT │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │
│ │
│ [Must Have] [Quick Win] [Optimization] [Conditional] [Future] │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Tier Constraints (ADR-004): Memory persistence depth and retention policies vary by monetization tier. Free tier has local-only storage; Team tier includes cloud persistence with GDPR auto-deletion; Enterprise tier supports configurable retention with legal hold. See ADR-004 for tier definitions and ADR-007 for implementation details.


Implementation Tracker

PhaseComponentStatusNotes
Phase 0Evaluation Harness✅ Completesrc/memory/evaluation/ - 28 tests
Phase 0Golden Dataset✅ Completetests/memory/golden_dataset.json - 50 questions
Phase 0Memory Schema & Lifecycle✅ Completesrc/memory/schemas/, src/memory/lifecycle/ - 36 tests
Phase 0Governance & Observability✅ Completesrc/memory/observability/ - 27 tests
Phase 0MemoryProvider Protocol✅ Completesrc/memory/protocols.py - ADR-007 compliant
Phase 0HNSW Recall Health Monitoring✅ Completesrc/memory/evaluation/recall_health.py, brute_force.py, baseline.py - 37 tests
Phase 0Embedding Version Tracking✅ Completesrc/memory/evaluation/embedding_version.py - Council verified
Phase 0Reindex Trigger✅ Completesrc/memory/maintenance/reindex_trigger.py - Auto-reindex on threshold breach
Phase 1Session Memory MCP✅ Completesession_memory_server.py - 45 tests
Phase 1Pixeltable MCP✅ Completepixeltable_mcp_server.py - 62 tests
Phase 1aHot Memory Storage✅ Completesrc/memory/storage/, src/memory/providers/local.py - 27 tests
Phase 1aMemory MCP Tools✅ Completesrc/memory/mcp/tools.py - 21 tests
Phase 1aHot Memory Latency✅ Complete<50ms p95 target met - 6 tests
Phase 1bAsync Extraction✅ Completesrc/memory/extraction/ - 36 tests
Phase 1bExtraction Precision✅ Complete>85% target met (100% achieved) - 5 tests
Phase 1cRetrieval & Ranking✅ Completesrc/memory/retrieval/ - 31 tests
Phase 1cQuery Rewriting✅ CompleteSynonym expansion for better recall
Phase 1cScope-Aware Retrieval✅ Completeuser > project > global hierarchy
Phase 1cRetrieval Latency✅ Complete<200ms p95 target met - 6 tests
Phase 1dJanitor Process✅ Completesrc/memory/janitor/ - 24 tests
Phase 1dDeduplication✅ Complete>85% similarity threshold
Phase 1dContradiction Handling✅ Complete"newer wins" with flagging
Phase 1dExpiration Cleanup✅ CompleteTTL-based expiration
Phase 1dJanitor Performance✅ Complete<10 min for 10k target met - 5 tests
CI/SecurityMemory Isolation✅ CompleteZero cross-user leakage - 8 tests
CI/SecurityProtocol Compliance✅ CompleteThree-layer testing - 23 tests
CI/SecurityCI Workflow✅ Complete.github/workflows/memory-evaluation.yml
Phase 2Memory Blocks Architecture✅ Complete5-block layout with XML delimiters
Phase 2Provenance Tracking✅ CompleteFull provenance on all retrievals
Phase 2Temporal Decay✅ CompleteIntegrated in retrieval ranking
Phase 2History Compression✅ CompleteLine-preserving truncation
Phase 2Token Efficiency✅ Complete40% efficiency (>30% target)
Phase 2Exit Criteria Tests✅ Complete6 benchmark tests passing
Phase 2Grounded Memory Ingestion✅ Complete3-tier provenance model: src/memory/ingestion/ - 62 tests (security hardened)
Phase 3Two-Stage Retrieval✅ Completesrc/memory/retrieval/ - BM25 + Vector + RRF + Cross-Encoder
Phase 3Provider Integration✅ CompleteLocalMemoryProvider with optional hybrid retrieval
Phase 3Exit Criteria✅ Complete<1s latency, multi-hop improvement - 15 benchmarks
Phase 3Entity Extraction✅ Completesrc/memory/extraction/entities/ - 81 tests, async pipeline, user_id auth
Phase 4Knowledge Graph✅ Completesrc/memory/graph/ - NetworkX backend, 96 tests
Phase 4Graph Types & Store✅ CompleteRelationshipType enum, GraphNode, GraphEdge, KnowledgeGraph
Phase 4Graph Builder✅ CompleteBuilds graph from Memory entities with relationship inference
Phase 4Graph Search✅ CompleteStage 1 candidate generation with traversal
Phase 4HybridRetriever Integration✅ Complete3-source RRF fusion (BM25 + Vector + Graph)
Phase 4LocalMemoryProvider Graph✅ CompleteOptional graph index management
Phase 4Hindsight Integration✅ Completesrc/memory/hindsight/ - 75 tests, four-network architecture
Phase 4Hindsight Types✅ CompleteNetworkType, TimeRange, TemporalEvent, TemporalMemory, StateChange
Phase 4Hindsight Timeline✅ CompleteEvent storage, state reconstruction, time queries
Phase 4Hindsight Temporal Search✅ CompleteNL temporal query parsing, relevance scoring
Phase 4.2MaaS Architecture✅ Completesrc/memory/maas/ - 149 tests, Trust Model documented
Phase 4.2Agent Registry✅ Completeregistry.py - agent identity, capabilities, sessions
Phase 4.2Pool Registry✅ Completepool.py - shared memory pools, membership, permissions
Phase 4.2Handoff Manager✅ Completehandoff.py - task handoffs, lifecycle management
Phase 4.2MCP Tools✅ Completemcp_tools.py - 15 async tools for agent/pool/handoff
Phase 4.2Security Hardening✅ Complete128-bit IDs, capacity limits, audit logging, defensive copies
Phase 4.2Trust Model✅ CompleteCouncil-documented architectural decisions (see Phase 4.2 section)
Option GContext Caching✅ Completesrc/memory/retrieval/cache.py - 37 tests, LRU+TTL
Option GRetrievalCache✅ CompleteLRU eviction, configurable TTL, thread-safe, metrics
Option GProvider Integration✅ CompleteLocalMemoryProvider use_cache, auto-invalidation
Phase DProduction Optimization✅ Completesrc/memory/observability/ - monitoring and ops
Phase DHNSW Scale Monitoring✅ Completescale_milestones.py - 10k/50k/100k thresholds - 36 tests
Phase DRRF Weight Tuning✅ Completefusion.py weights parameter - 6 tests
Phase DGraph Query Monitoring✅ Completegraph_metrics.py - hop latency, size tracking - 27 tests
Phase DOperations Runbook✅ Completedocs/operations/memory-runbook.md

Test Summary: 1282 memory tests passing (as of 2026-01-16) - 149 new MaaS tests

Legend: ✅ Complete | 🔄 Partial | 📝 Not Started | ❌ Blocked


Consequences

Positive

  • AI assistants maintain context across sessions
  • Reduced repetitive context loading (estimated 60-80% reduction)
  • Institutional knowledge becomes searchable
  • Consistent AI suggestions aligned with past decisions
  • Multi-project organizational awareness
  • Foundation for advanced multi-agent workflows

Negative

  • Requires initial setup and ingestion
  • Python version binding (mitigated by ADR-001)
  • Storage growth with codebase size
  • Embedding computation cost (mitigated by local models)
  • New infrastructure to maintain (Pixeltable, MCP servers)
  • Team learning curve on memory-augmented AI patterns

Neutral

  • Requires MCP-compatible client
  • Team adoption needed for full benefit
  • ADR/incident discipline improves value
  • Commits us to MCP protocol (reasonable bet given industry adoption)
  • May require revisiting if context windows grow significantly larger

Risks and Mitigations (Council Additions)

Technical Risks

RiskProbabilityImpactMitigation
Memory PoisoningMediumHighProvenance tracking, confidence scoring, human approval gates for "decision-grade" memories
Staleness & DriftHighHighLink memories to repo commits, TTL policies, periodic revalidation jobs, git hook triggers
ContradictionsMediumMediumConflict resolution policy (prefer newer? prefer docs?), show both with citations
Latency BlowupsMediumHighBudgets per tier, caching, "fast path" vs "deep reasoning path" separation
Relevance PollutionHighMediumTemporal decay (Option K), ranking algorithms, "forgetting curve" implementation
Schema DriftHighMediumAsync re-indexing on commit, version migrations, schema evolution plan
HNSW Silent Recall Degradation (Jan 2026)HighHighRecall@k monitoring against exact search baseline; automated reindex triggers; retuning milestones at 10k/50k/100k items
Filter-Induced Recall Collapse (Jan 2026)MediumHighEvaluate recall with metadata filters applied (not just unfiltered); test heavy filtering scenarios (tenant, date, doc_type)
Embedding Model Drift (Jan 2026)MediumHighVersion-tag all embeddings; on model change, flag for re-embedding; strict index-to-model versioning
Retrieval Poisoning (Jan 2026)LowHighTreat retrieved context as untrusted input; sanitization layer before injecting into system prompt; filter malicious instruction patterns

Organizational Risks

RiskProbabilityImpactMitigation
Scope Creep via OptionsHighHighHard constraint: Pick ≤3 options for 2025, phase gates required
Benchmark GamingLowMediumInclude qualitative user feedback, test against real project history
Vendor Lock-inMediumMediumDefine internal memory APIs and adapters, avoid proprietary-only features
Operational ComplexityMediumMediumDocument backup/migration procedures, plan embedding model upgrades

Security & Privacy Risks

RiskProbabilityImpactMitigation
Memory LeakageMediumHighStrict namespace isolation, MCP scoping, audit logs for cross-project retrievals
MEXTRA AttacksLowHighOutput filtering, memory de-identification, input validation
Secret ExposureLowCriticalSecret redaction in ingestion, .gitignore-style exclusion patterns

Risk Assessment by Council Model

Gemini: "Memory update triggers (Git hooks) are required to keep memory synchronized. Without them, the Knowledge Graph becomes a hallucination source."

Claude: "Wrong context is worse than no context. Memory poisoning needs confidence scoring and human approval gates."

Grok: "Ensure memory systems don't perpetuate biases in persistent context via biased knowledge graphs."

GPT: "Memory systems need pruning strategies - relevance pollution is a real problem at scale."

Security Considerations

Per recent research (MEXTRA attack, February 2025), memory systems are vulnerable to:

  • Prompt injection extracting stored memories
  • Cross-session data leakage

Mitigations (to implement):

  • User/session-level isolation
  • Memory de-identification
  • Output filtering
  • Audit logging
  • ADR-001: Python Version Requirement (database integrity)
  • ADR-002: Workflow Integration (automated ingestion)

References

Open Questions

The following questions have been investigated and resolved:

  1. Single-tenant vs Multi-tenant: ✅ Resolved - Multi-tenancy is supported via the TenantProvider protocol (ADR-005) and CloudTenantProvider implementation (ADR-007). See ADR-007 Section 1a for tenant isolation enforcement points.

  2. Memory Approval Workflows: ✅ Resolved - The AuditLogger protocol (ADR-007) provides provenance tracking. GDPR compliance workflows in GDPRService (luminescent-cloud) implement approval gates for data deletion.

  3. Retention Policies: ✅ Resolved - ADR-007 Section 4 defines tier-specific retention:

    • Free (OSS): User's responsibility (no auto-deletion)
    • Team (Cloud): Auto-delete on workspace exit (GDPR Article 17)
    • Enterprise: Configurable (legal hold support)
  4. Embedding Model Upgrades: 🔄 Partially Resolved - Pixeltable computed columns support re-indexing. Full migration strategy deferred to Phase 3 (HybridRAG).

  5. Conflict Resolution: 🔄 Partially Resolved - Phase 1d Janitor Process implements "newer wins" with flagging for review. Complex contradiction handling deferred to Phase 2 (Context Engineering).

Council Review Summary

Design Review (2025-12-22)

Council Configuration: High confidence (all 4 models) Models: Gemini-3-Pro, Claude Opus 4.5, Grok-4, GPT-5.2-Pro

Unanimous Recommendations (All 4 Models)

  1. Add Phase 0 for foundations, evaluation, and governance
  2. Reorder roadmap: Mem0 before HybridRAG
  3. Add quantified success metrics
  4. Add provenance/governance as core requirement
  5. Include context caching for cost optimization

Key Insights by Model

  • Gemini: "Optimization before Foundation" is the core flaw
  • Claude: "Wrong context is worse than no context"
  • Grok: Emphasized bias prevention in knowledge graphs
  • GPT: Highlighted need for "Phase 0" evaluation harness

Implementation Verification (2026-01-07)

Council Configuration: High confidence (all 4 models) Models: GPT-5.2-Pro, Gemini-3-Pro, Claude Opus 4.5, Grok-4 Transcripts: .council/logs/2026-01-07T21-25-55-87316c79/, 2026-01-07T21-41-43-284ed67c/, 2026-01-07T21-49-29-4db02c46/

Round 1 (Commit 0c8c41a) - REJECTED

IssueSeverityFix Applied
EvaluationHarness defaulted success=True when no evaluate_fnCriticalChanged to success=False
Precision/recall/F1 aliased to accuracy (not calculated)CriticalNow uses metrics.py functions
O(N²) janitor complexityHighDocumented as known trade-off (meets <10min target)
Silent exception swallowingHighAdded error collection and reporting

Round 2 (Commit e448f22) - REJECTED

IssueSeverityFix Applied
Janitor fallback modified metadata but didn't persistCriticalRemoved broken fallback, use invalidate/delete only
FP/FN indistinguishable (both set to failed)CriticalNow tracks by retrieved_memories presence
invalidated count incremented without actionCriticalOnly increment on successful operation

Round 3 (Commit 720a61b) - UNCLEAR (Confidence 0.54)

Remaining ConcernStatusRationale
O(N²) complexityAcceptableMeets ADR-003 target (<10min for 10k memories, actual: <3s)
resolve_duplicates unusedBy DesignUsed internally by find_duplicates workflow
Keyword-based contradiction detectionPhase 1 ScopeSemantic analysis deferred to Phase 2
Hard-coded limit=10000AcceptableDocumented limitation for Phase 1

Round 4 (Commit 0efa8a2) - UNCLEAR (Confidence 0.61, No Blocking Issues)

Focus: ADR-005 Dual-Repo Compliance Fix

IssueSeverityFix Applied
MemoryProvider not exported from extensions moduleCriticalAdded to __init__.py imports and __all__
ResponseFilter not exportedCriticalAdded to __init__.py imports and __all__
MEMORY_PROVIDER_VERSION not exportedCriticalAdded to __init__.py imports and __all__
Misleading docstring import pathMediumFixed to use src.extensions
Brittle _is_runtime_protocol checkMediumChanged to isinstance() test

Outcome: No blocking issues. Rationale "approved" at 0.9 confidence. Transcript: .council/logs/2026-01-07T22-43-42-b3f93afd/

Final Scores (Round 4)

  • Accuracy: 10.0/10
  • Completeness: 6.3/10
  • Clarity: 7.0/10
  • Blocking Issues: None

Split Decision (Rounds 1-3)

  • GPT-5.2-Pro: REJECTED (strictest interpretation)
  • Gemini-3-Pro: REJECTED (completeness concerns)
  • Claude Opus 4.5: NEEDS REVIEW (acceptable for Phase 1)
  • Grok-4: APPROVED (most lenient)
  • ADR-001: Python Version Requirement (database integrity)
  • ADR-002: Workflow Integration (automated ingestion)
  • ADR-004: Monetization Strategy (tier-based memory constraints)
  • ADR-005: Repository Organization Strategy (OSS vs Paid separation, Extension Registry)
  • ADR-006: Chatbot Platform Integrations (adapters implementing memory I/O)
  • ADR-007: Cross-ADR Integration Guide (Phase Alignment Matrix, protocol consolidation)

Changelog

VersionDateChanges
1.02025-12-22Initial draft
2.02025-12-22Added industry research and 6 improvement options
3.02025-12-22Council Review: Revised roadmap (Phase 0 + reordering), added decision drivers, non-goals, success metrics table, interface contract, 5 additional options (G-K), comprehensive risk matrix, council feedback summary
4.02025-12-22Council Review #2: Rejected Mem0 integration in favor of Pixeltable-native memory. Added "Build vs Integrate" decision section with code examples. Updated Phase 1 to Pixeltable Native approach. Added reference to ADR-004 (Monetization).
4.12025-12-23Council Validation: Extended Phase 1 to 8 weeks. Added "Janitor" (Hot/Warm/Cold) architecture for async extraction. Added Golden Dataset requirement. Added validation metrics and decision reversal triggers.
4.22025-12-28Cross-ADR Synchronization: Added ADR-006 and ADR-007 to Related Decisions. Updated Non-Goals to reflect multi-tenancy via Extension Registry. Added Interface Contract admonition linking to consolidated protocols. Marked Open Questions as resolved with references. Added Implementation Tracker section. Added tier constraints note to Roadmap. Council review (Grok-4, GPT-4o).
4.32026-01-07Implementation Verification: Council reviewed Phase 0-1d implementation across 3 rounds. Fixed critical bugs in EvaluationHarness (metrics calculation) and Janitor (persistence, error handling). Added dry-run mode and soft-delete to janitor. Updated test count to 330. Added Implementation Verification section with Council transcripts. Phase 2+ remains NOT STARTED.
4.42026-01-07ADR-005 Compliance Fix: Exported MemoryProvider, ResponseFilter, MEMORY_PROVIDER_VERSION from src/extensions/__init__.py. Council Round 4: No blocking issues, 10.0/10 accuracy. Fixed #117, unblocked #114. Test count: 1008 total (336 memory-specific).
4.52026-01-08Phase 2 Complete: Implemented Memory Blocks Architecture with 5-block layout (System, Project, Task, History, Knowledge). Added provenance tracking on all retrievals, line-preserving truncation, XML-safe delimiters. Met all exit criteria: 40% token efficiency (>30% target), provenance on all items, stale detection operational. Test count: 436 memory tests. Council verified across 8 rounds.
4.62026-01-08Security Hardening (Council Rounds 13-19): Comprehensive DoS prevention in ProvenanceService. Added: bounded LRU storage, string identifier length limits, metadata bounds validation, recursive nested structure validation with early termination, strict JSON type safety, cycle detection, UTF-8 byte size validation, TOCTOU prevention via deep copy, score range validation (0.0-1.0). Test count: 490 memory tests (64 provenance-specific security tests).
4.72026-01-09Research-Driven Strategy Update: Council-validated additions based on January 2026 RAG research. Phase 0: Added HNSW Recall Health Monitoring (critical silent failure mode), recall@k against exact search baseline, retuning milestones at 10k/50k/100k items, embedding versioning. Phase 2: Added Grounded Memory Ingestion with 3-tier provenance model, Evidence Object schema. Phase 3: Added Two-Stage Retrieval Architecture intent (RRF, multi-query expansion, cross-encoder reranking). Risks: Added HNSW silent recall degradation, filter-induced recall collapse, embedding model drift, retrieval poisoning. References: HNSW at Scale (TDS), 12 RAG Types (TuringPost).
4.82026-01-09HNSW Recall Health Complete: Implemented full Phase 0 HNSW monitoring suite. Added recall_health.py (RecallHealthMonitor, Recall@k computation), brute_force.py (exact cosine similarity ground truth), baseline.py (drift detection with atomic writes, history pruning), embedding_version.py (model version tracking). Security hardening: symlink protection, path containment, PII exclusion, salted hashing. Added ReindexTrigger for auto-reindex. Council verified across 31 rounds (0 blocking issues, 10/10 accuracy). Test count: 527 memory tests. Remaining: Grounded Memory Ingestion (Phase 2), Two-Stage Retrieval (Phase 3).
4.92026-01-09Grounded Memory Ingestion Complete: Implemented Phase 2 3-tier provenance model to prevent hallucination write-back. Added src/memory/ingestion/ package with 8 modules: evidence.py (EvidenceObject), result.py (ValidationResult, IngestionTier), citation_detector.py (ADR/commit/URL regex), hedge_detector.py (speculative language blocking), dedup_checker.py (Jaccard similarity >0.92), validator.py (IngestionValidator orchestrator), review_queue.py (Tier 2 pending memories). Integrated into create_memory() MCP tool with bypass_validation option. Added review queue MCP tools: get_pending_memories, approve_pending_memory, reject_pending_memory. Exit criteria met: zero hallucination write-back in grounded ingestion tests. Test count: 575 memory tests (48 ingestion-specific). Remaining: Two-Stage Retrieval (Phase 3).
5.02026-01-09Grounded Memory Ingestion Security Hardened: Council verification across 6 rounds identified and fixed critical security vulnerabilities. Fixes: (1) Hedge bypass via assertion markers - assertions no longer override is_speculative; (2) Dedup fail-open - now raises DedupCheckError, flags for review; (3) Cross-tenant data leak in get_review_history - requires user_id; (4) Cross-tenant DoS via eviction - rejects at capacity; (5) IDOR in get_by_id - requires authorization; (6) Strong speculation missed - added "i don't know"; (7) Race condition in approve - removes before callback; (8) Unbounded history growth - added MAX_HISTORY_ENTRIES. Council synthesis: "approved" at 0.9 confidence. Test count: 589 memory tests (62 ingestion-specific).
6.02026-01-09Two-Stage Retrieval Architecture Complete (Phase 3): Implemented hybrid retrieval as specified in ADR-003 Phase 3. Stage 1 (Parallel Candidate Generation): bm25.py (BM25 sparse keyword search with tokenization, IDF, multi-tenant indexes), vector_search.py (dense semantic search with sentence-transformers, lazy loading, cosine similarity). Stage 2 (Fusion + Reranking): fusion.py (RRF algorithm with k=60, weighted fusion, fuse_with_details for provenance), reranker.py (cross-encoder reranking with ms-marco-MiniLM-L-6-v2, fallback reranker for fast mode). Orchestrator: hybrid.py (HybridRetriever with parallel Stage 1 via asyncio.gather, RRF fusion, optional reranking, RetrievalMetrics for latency tracking). Integration: Updated src/memory/retrieval/__init__.py exports. Exit Criteria Met: latency <1s, multi-hop queries outperform pure vector. Test count: 757 memory tests (168 new retrieval tests, 15 exit benchmarks). Remaining: Knowledge Graph (Phase 3), Entity Extraction (Phase 3).
6.12026-01-09Provider Integration Complete: Integrated HybridRetriever into LocalMemoryProvider with backwards compatibility. Features: Optional use_hybrid_retrieval parameter (default: False), use_cross_encoder for quality vs speed tradeoff, use_query_rewriter for term expansion. New Methods: retrieve_with_scores() returns (Memory, score) tuples, retrieve_with_metrics() returns detailed RetrievalMetrics. Automatic Index Management: store/delete/clear operations sync with hybrid indexes. Bug Fix: Fixed hybrid.py to use query_rewriter.rewrite() instead of expand() (string vs list return type). Test count: 773 memory tests (16 new provider integration tests). Remaining: Knowledge Graph (Phase 3), Entity Extraction (Phase 3).
6.22026-01-16Entity Extraction Complete (Phase 3): Implemented async entity extraction pipeline for knowledge graph support. New Module: src/memory/extraction/entities/ with 6 files. Types: EntityType enum (SERVICE, DEPENDENCY, API, PATTERN, FRAMEWORK, CONFIG), Entity dataclass with confidence scoring, EntityExtractor protocol. Extractors: MockEntityExtractor (pattern-based for testing), HaikuEntityExtractor (LLM-based for production). Pipeline: EntityExtractionPipeline with process() (blocking) and process_async() (fire-and-forget). Security: user_id authorization check prevents cross-user memory modification. Storage: Entities stored in Memory.metadata["entities"] for knowledge graph construction. Provider Fix: LocalMemoryProvider now supports metadata merge updates. GitHub Issues: Closed #118, #119, #120, #121. Test count: 854 memory tests (81 entity extraction tests incl. 2 security). TDD approach + council security review. Remaining: Knowledge Graph (Phase 4).
6.32026-01-16Knowledge Graph Complete (Phase 4): Implemented multi-hop reasoning for queries like "What services depend on PostgreSQL?". New Module: src/memory/graph/ with 5 files. Types: RelationshipType enum (7 types: DEPENDS_ON, USES, IMPLEMENTS, CALLS, CONFIGURES, HAD_INCIDENT, OWNED_BY), GraphNode dataclass, GraphEdge dataclass. Store: KnowledgeGraph using NetworkX DiGraph, supports serialization. Builder: GraphBuilder creates graph from Memory entities with relationship inference. Search: GraphSearch for Stage 1 candidate generation with traversal scoring (direct match 1.0, neighbor 0.7, predecessor 0.6). HybridRetriever: Updated for 3-source RRF fusion (BM25 + Vector + Graph), metrics track graph_candidates. LocalMemoryProvider: Optional use_graph=True parameter, automatic graph building from stored memories. Exit Criteria Met: <1s latency, multi-hop queries outperform pure vector. GitHub Issues: Closed #122, #123, #124, #125, #126. Test count: 955 memory tests (96 new Knowledge Graph tests). TDD approach with 11 integration tests.
6.42026-01-16Hindsight Temporal Memory Complete (Phase 4): Implemented four-network temporal memory for time-based queries. New Module: src/memory/hindsight/ with 4 files. Types: NetworkType enum (WORLD, BANK, OPINION, OBSERVATION), TimeRange with relative time support, TemporalEvent with validity periods, TemporalMemory wrapper, StateChange for transitions. Timeline: Event storage with indexing by entity/network/time, state reconstruction at any point in time. Temporal Search: NL temporal query parsing ("last month", "Q4 2025", "before incident-123"), relevance scoring with temporal context. Target Queries: "What changed last month?", "What was auth-service status before incident-123?", "Show me decisions made in Q4 2025". Council Review: Four-network architecture validated, security recommendations noted, HybridRetriever integration recommended. GitHub Issues: Closed #123, #124, #125, #126. Test count: 1030 memory tests (75 new Hindsight tests). TDD approach.
6.52026-01-16Context Caching Complete (Option G): Implemented provider-side caching for 75-90% cost reduction. RetrievalCache: src/memory/retrieval/cache.py with LRU eviction, configurable TTL (default 3600s), thread-safe operations (RLock), per-user invalidation, hit/miss metrics, serialization support. Provider Integration: LocalMemoryProvider use_cache parameter (default: False), automatic invalidation on store/delete/clear, get_cache_metrics() for monitoring. Council Review: Unanimous approval, invalidation strategy validated, auth drift concern noted (recommend invalidate on permission changes), default TTL 5 minutes recommended. GitHub Issues: Closed #127, #128. Test count: 1067 memory tests (37 new Caching tests). TDD approach.
6.62026-01-16Production Optimization Complete (Phase D): Implemented operational monitoring and documentation for production readiness. HNSW Scale Monitoring: src/memory/observability/scale_milestones.py with ScaleMilestoneTracker (10k/50k/100k items), MilestoneCheckResult for health checks, configurable recall/latency thresholds, callback-based health check triggers. RRF Weight Tuning: Updated fusion.py with per-source weights parameter for tunable multi-source retrieval. Graph Query Monitoring: src/memory/observability/graph_metrics.py with GraphMetricsCollector (p50/p95/p99 latency), hop-level latency breakdown (direct/neighbor/predecessor), QueryMeasurementContext for easy instrumentation, GraphSizeSnapshot for node/edge tracking. Operations Runbook: docs/operations/memory-runbook.md with architecture overview, key metrics, troubleshooting guides, scaling guidelines, health check procedures. GitHub Issues: Closed #129, #130, #131. Test count: 1133 memory tests (69 new Phase D tests: 36 scale milestones, 6 RRF weights, 27 graph metrics). TDD approach.
6.72026-01-16MaaS Trust Model Documented (Phase 4.2): Added Trust Model section documenting architectural decisions raised by LLM Council. Trust Boundary: MaaS APIs are internal interfaces called by trusted orchestrator layer. What's Enforced: Capability checks, scope hierarchy, capacity limits (DoS prevention), audit logging, 128-bit ID entropy, defensive copies, handoff lifecycle. What's NOT Enforced: Authentication (MCP server layer), owner_id verification (orchestrator responsibility), pool membership authorization (orchestrator controls exposure). Rationale: Follows same pattern as ExtensionRegistry, LocalMemoryProvider. Implementation Tracker: Updated with 7 MaaS completion entries. Test count: 1282 memory tests (149 MaaS tests). Council feedback incorporated.