ADR-003: Project Intent - Persistent Technical Context for AI Development
Status: Accepted Date: 2025-12-22 Decision Makers: Development Team Owners: @christopherjoseph Version: 6.7 (MaaS Trust Model)
Decision Summary
We adopt a Memory-First architecture using MCP, implementing a phased approach that prioritizes foundations and quick wins before architectural complexity:
- Phase 0: Evaluation, governance, and observability foundations
- Phase 1: Mem0 for immediate personalization and conversational memory
- Phase 2: Context Engineering for optimized retrieval
- Phase 3: HybridRAG with Knowledge Graph for complex queries
- Phase 4: Advanced features (Hindsight, MaaS) as needed
We reject simple Vector-only RAG as insufficient for code logic and dependency reasoning.
Critical Architecture Decision: Build vs. Integrate (Mem0)
The Question
Should we integrate Mem0 for conversational memory, or build equivalent capabilities on Pixeltable?
| Capability | Pixeltable | Mem0 |
|---|---|---|
| Vector embeddings | ✓ | ✓ |
| Semantic search | ✓ | ✓ |
| Structured metadata | ✓ | ✓ |
| Graph relationships | Partial | ✓ (Mem0g) |
| User/session scoping | Manual | Built-in |
| Memory extraction | Manual | Automatic |
Council Decision: Build on Pixeltable (Option B)
All 4 models unanimously agreed: Pixeltable must remain the canonical source of truth.
The "Split Brain" Problem
Integrating Mem0 creates two vector databases:
User Query → MCP Server → ???
├→ Pixeltable (code, ADRs, incidents)
└→ Mem0 (conversations, preferences)
└→ Its own vector DB
Problems this creates:
- Two sources of truth with different consistency models
- Cross-system queries for "What did we decide about auth?" are complex
- Different score scales make result merging difficult
- Debugging spans two systems with different logging
Why Pixeltable Can Replace Mem0
Mem0's primary value (extraction + scoping) can be replicated using Pixeltable's computed columns:
# Memory extraction via LLM UDF (computed column)
@pt.computed_column
def extract_memory_facts(row) -> list[dict]:
"""Extract memorable facts from conversation."""
prompt = """
Analyze this conversation and extract:
1. User preferences (explicit and implicit)
2. Decisions made
3. Facts learned about the codebase
4. Corrections to previous understanding
Return as structured JSON with confidence scores.
"""
return llm_extract(prompt, row.content)
# User/session scoping
@pt.computed_column
def memory_scope(row) -> dict:
return {
"user_id": row.user_id,
"project_id": row.project_id,
"scope_hierarchy": [
f"user:{row.user_id}",
f"project:{row.project_id}",
"global"
]
}
# Memory consolidation (dedup, merge, supersede)
class MemoryConsolidator:
async def consolidate(self, new_memory: Memory) -> ConsolidationResult:
similar = await self.find_similar(new_memory, threshold=0.85)
if not similar:
return ConsolidationResult(action="insert", memory=new_memory)
# Handle contradictions, merges, supersedes...
Decision Matrix
| Criteria | A) Integrate Mem0 | B) Build on Pixeltable |
|---|---|---|
| Time to MVP | 2-3 weeks | 4-6 weeks |
| Maintenance burden | High (two systems) | Low (one system) |
| Feature velocity | Fast initially, slows | Slower initially, compounds |
| Architectural simplicity | Poor | Excellent |
| Long-term flexibility | Limited by Mem0's roadmap | Full control |
| Debugging | High complexity | Low complexity |
| Data consistency | Two sources of truth | Single source |
Accepted Trade-offs
- Longer initial development (6 weeks vs 2-3 weeks)
- Must build extraction/consolidation logic ourselves
- No access to Mem0's ongoing R&D
Validation Criteria (Phase 1 Exit)
- Memory retrieval latency < 200ms p95
- Extraction accuracy > 80% (manual evaluation)
- Zero cross-user memory leakage
Context
The Problem: Ephemeral AI Context
Large Language Models (LLMs) suffer from a fundamental limitation: context window amnesia. Each conversation starts fresh, forcing developers to repeatedly explain:
- Project architecture and design decisions
- Coding conventions and patterns
- Past incidents and their resolutions
- Domain-specific terminology
- Team decisions and their rationale
This creates three significant pain points:
- Repetitive Context Loading: Developers waste time re-establishing context every session
- Lost Institutional Knowledge: Valuable decisions and learnings evaporate between conversations
- Inconsistent Assistance: Without historical context, AI suggestions may contradict past decisions
The Vision: Persistent Technical Memory
Luminescent Cluster aims to give AI assistants persistent technical memory - the ability to recall project context, architectural decisions, incident history, and codebase knowledge across sessions and even across different LLM providers.
Industry Context (December 2025)
The LLM memory landscape has evolved rapidly. Key developments:
- MCP Standardization: The Model Context Protocol is now the de-facto standard, adopted by OpenAI, Google DeepMind, Microsoft, and AWS. In December 2025, Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation (AAIF).
- Beyond RAG: Traditional RAG is being challenged by agentic memory architectures that maintain context over time, track evolving beliefs, and perform temporal reasoning.
- Production Scale: Systems like Mem0 have achieved 26% accuracy improvements with 91% lower latency and 90% token savings at enterprise scale (186M API calls/Q3 2025).
- Context Engineering: A new discipline focused on "the delicate art and science of filling the context window with just the right information" (Andrej Karpathy).
Decision Drivers (Ranked)
| Priority | Driver | Weight | Rationale |
|---|---|---|---|
| 1 | Developer Experience | 30% | Reduce repetition friction; faster onboarding |
| 2 | Accuracy | 25% | Retrieved context must be relevant and correct |
| 3 | Token Efficiency | 20% | Context window is expensive; minimize waste |
| 4 | Implementation Velocity | 15% | Need wins this quarter to validate approach |
| 5 | Future Flexibility | 10% | Don't lock in prematurely; enable pivots |
Non-Goals
This ADR explicitly does not address:
- Multi-tenant implementation details: Multi-tenancy is technically supported via the Extension Registry (ADR-005) and unified in the integration layer (ADR-007). This ADR remains focused strictly on memory architecture; tenant isolation is treated here as an external constraint rather than a core feature
- General documentation RAG: Focus is on technical context, not help docs
- AGI-style continuous learning: Memory is curated, not autonomous
- PII/secrets storage: Sensitive data excluded by policy
- Training custom models: Memory augments existing LLMs, doesn't train new ones
Success Metrics (Quantified)
| Metric | Baseline | Target | Measurement |
|---|---|---|---|
| Context re-explanation frequency | ~5/session | <1/session | User survey |
| Context retrieval latency | N/A | <500ms p95 | Instrumentation |
| Retrieval precision@5 | N/A | >85% | LongMemEval subset |
| Token efficiency | N/A | <30% of window for memory | Context analysis |
| Multi-hop query accuracy | N/A | >75% | Internal benchmark |
| Developer satisfaction (NPS) | N/A | >50 | Quarterly survey |
Decision
We implement a three-tier memory architecture exposed via the Model Context Protocol (MCP):
Tier 1: Session Memory (Hot Context)
Purpose: Fast access to current development state
Implementation: session_memory_server.py
Data Sources:
- Git repository state (current branch, status)
- Recent commits (last 200)
- Current diff (staged/unstaged changes)
- File change history
- Active task context (set by user/agent)
Characteristics:
- Latency: <10ms (in-memory)
- Scope: Current repository
- Persistence: Ephemeral (session-bound)
- Cost: Zero (local computation)
Tier 2: Long-term Memory (Persistent Knowledge)
Purpose: Semantic search over organizational knowledge
Implementation: pixeltable_mcp_server.py + pixeltable_setup.py
Data Sources:
- Code repositories (multi-service)
- Architectural Decision Records (ADRs)
- Production incident history
- Meeting transcripts
- Documentation
Characteristics:
- Latency: 100-500ms (semantic search)
- Scope: Entire organization, cross-project
- Persistence: Durable (survives restarts)
- Cost: Embedding generation (local sentence-transformers by default)
Tier 3: Intelligent Orchestration
Purpose: Efficient multi-tool coordination Mechanisms:
- Tool Search: On-demand tool discovery (85% token reduction)
- Programmatic Tool Calling: Batch operations in sandbox (37% token reduction)
- Deferred Loading: Heavy tools loaded only when needed
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AI Assistant (Claude, etc.) │
├─────────────────────────────────────────────────────────────────┤
│ Model Context Protocol (MCP) │
├────────────────┬────────────────────────┬───────────────────────┤
│ Tier 1 │ Tier 2 │ Tier 3 │
│ Session Memory │ Long-term Memory │ Orchestration │
├────────────────┼────────────────────────┼───────────────────────┤
│ • Git state │ • Pixeltable DB │ • Tool Search │
│ • Recent commits│ • Semantic embeddings │ • Programmatic Calls │
│ • Current diff │ • Multi-project index │ • Deferred Loading │
│ • Task context │ • ADRs, incidents │ │
└────────────────┴────────────────────────┴───────────────────────┘
Interface Contract (MCP)
Implementation Note: These interface definitions represent the conceptual architectural vision. For live, versioned protocol signatures (including
ContextStore,TenantProvider, andAuditLogger), refer tosrc/extensions/protocols.pyas consolidated in ADR-007. Chatbot-specific implementations of these interfaces are provided by the adapters defined in ADR-006.
The memory system exposes the following MCP resources and tools:
Resources (Read-Only Context)
project://memory/recent_decisions # Last N architectural decisions
project://memory/active_incidents # Open incidents affecting this service
project://memory/conventions # Coding patterns and team standards
project://memory/dependency_graph # Service relationships (Phase 3+)
Tools (Actions)
# Session Memory Tools
set_task_context(task: str, details: dict) # Set current work context
get_task_context() -> TaskContext # Retrieve current context
search_commits(query: str) -> list[Commit] # Search commit history
# Long-term Memory Tools
semantic_search(query: str, limit: int) -> list[Result]
ingest_code(path: str, service: str) # Index codebase
ingest_adr(path: str, service: str) # Index decision record
ingest_incident(summary: str, severity: str, lessons: str)
# Memory Management (Phase 0+)
update_memory(key: str, value: Any, source: str) # With provenance
invalidate_memory(key: str, reason: str) # Explicit expiration
get_memory_provenance(key: str) -> Provenance # Audit trail
Key Design Principles
1. LLM-Agnostic via MCP
The system uses the Model Context Protocol (MCP), making it portable across:
- Claude (via Claude Code)
- OpenAI ChatGPT (MCP support added March 2025)
- Google Gemini (MCP support announced April 2025)
- Custom agents via programmatic access
2. Semantic Search over Keyword Matching
Long-term memory uses sentence-transformer embeddings for semantic similarity:
# Find conceptually related code, not just keyword matches
"authentication flow" → finds OAuth, JWT, session handling code
3. Multi-Project Awareness
Knowledge is indexed by service_name, enabling:
- Cross-project searches ("How does auth-service handle tokens?")
- Project-specific filtering ("Show incidents for payment-api only")
- Organizational patterns ("What database patterns do we use?")
4. Automatic Embedding Maintenance
Pixeltable's computed columns automatically recompute embeddings when content changes:
# Embeddings stay in sync - no manual refresh needed
embedding=kb.content.apply(embed_text)
5. Defense in Depth (Python Version)
Per ADR-001, the system includes 7 layers of protection against Python version mismatch issues that could corrupt the Pixeltable database.
Use Cases
1. Architectural Continuity
User: "Why did we choose PostgreSQL over MongoDB?"
AI: [Queries ADRs] "ADR-005 documents this decision from March 2024..."
2. Incident-Aware Development
User: "I'm adding rate limiting to the auth service"
AI: [Queries incidents] "Note: We had an outage in November due to rate limiter
misconfiguration. The post-mortem recommended..."
3. Cross-Session Context
Session 1: User sets task context "Implementing OAuth2 PKCE flow"
Session 2: AI recalls task context and relevant ADRs automatically
4. Codebase Navigation
User: "How do we handle database connections?"
AI: [Semantic search] "Based on auth-service/db/pool.py and the connection
pooling ADR, you use..."
Rationale
Why Not Just Use RAG?
Traditional RAG (Retrieval Augmented Generation) typically:
- Requires manual chunk management
- Needs explicit embedding refresh
- Lacks structured metadata (service, type, severity)
- Doesn't integrate with development workflow
Luminescent Cluster adds:
- Computed columns: Auto-updating embeddings
- Typed knowledge: Code vs ADR vs incident with different schemas
- Git integration: Session memory tied to repository state
- MCP exposure: Native tool integration with AI assistants
Why MCP over Custom APIs?
- Standardized: Works with any MCP-compatible client
- Discoverable: Tools are self-documenting
- Composable: Clients can orchestrate multiple MCP servers
- Future-proof: Now backed by Linux Foundation with industry-wide adoption
Why Pixeltable?
- Computed columns: Embeddings auto-update on content change
- Multimodal ready: Can extend to images, videos
- Snapshot/restore: Built-in versioning for knowledge base
- Python-native: Fits development workflow
Improvement Options (December 2025 Research)
Based on current industry developments, the following enhancement options are presented for council review:
Option A: HybridRAG - Knowledge Graph Integration
What: Combine vector embeddings with a knowledge graph for multi-hop reasoning.
Industry Evidence:
- Microsoft Research shows 2.8x accuracy improvement with hybrid approaches
- GraphRAG (Microsoft) constructs knowledge graphs from unstructured data
- Cedars-Sinai's AlzKB demonstrates real-world HybridRAG success
Implementation:
┌─────────────────────────────────────────────────────────────────┐
│ HybridRAG Architecture │
├────────────────────────┬────────────────────────────────────────┤
│ Vector Search │ Graph Traversal │
│ (Pixeltable) │ (Neo4j/Memgraph) │
├────────────────────────┼────────────────────────────────────────┤
│ • Semantic similarity │ • Entity relationships │
│ • Fuzzy matching │ • Multi-hop reasoning │
│ • Fast retrieval │ • Causal chains │
└────────────────────────┴────────────────────────────────────────┘
│
▼
Reciprocal Rank Fusion
│
▼
Neural Reranker
Benefits:
- Multi-hop reasoning: "What services depend on the auth module that had the incident?"
- Explicit relationships: Code → ADR → Incident → Resolution chains
- Better temporal reasoning: Track how decisions evolved
Tradeoffs:
- Additional infrastructure (graph database)
- Data modeling complexity
- Sync between vector and graph stores
Effort: Medium-High Impact: High for complex organizational queries
Option B: Mem0 Integration
What: Integrate Mem0's production-proven memory layer alongside Pixeltable.
Industry Evidence:
- 26% accuracy improvement, 91% lower p95 latency, 90% token savings
- 41K GitHub stars, 14M downloads, 186M API calls/quarter
- AWS chose Mem0 as exclusive memory provider for Agent SDK
- SOC 2 & HIPAA compliant
Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Memory Layer │
├────────────────────────┬────────────────────────────────────────┤
│ Pixeltable │ Mem0 │
│ (Long-term KB) │ (Conversational Memory) │
├────────────────────────┼────────────────────────────────────────┤
│ • Codebase index │ • User preferences │
│ • ADRs & incidents │ • Conversation facts │
│ • Static knowledge │ • Dynamic learning │
│ • Service-scoped │ • User/session-scoped │
└────────────────────────┴────────────────────────────────────────┘
Benefits:
- Production-proven scale and performance
- Graph-based memory variant (Mem0g) for relational knowledge
- Three-line integration
- Enterprise compliance built-in
Tradeoffs:
- External dependency (cloud or self-hosted)
- Potential overlap with Pixeltable functionality
- Additional cost for hosted version
Effort: Low-Medium Impact: High for personalization and learning
Option C: Hindsight Agentic Memory
What: Adopt the four-network memory architecture that achieved 91.4% on LongMemEval.
Industry Evidence:
- Highest accuracy on LongMemEval benchmark (December 2025)
- Designed specifically for agents needing temporal/causal reasoning
- TEMPR retrieval: semantic + keyword + graph + temporal filtering
Four Network Architecture:
┌──────────────────────────────────────────────────────────────────┐
│ Hindsight Memory Networks │
├─────────────────┬─────────────────┬─────────────────┬────────────┤
│ World │ Bank │ Opinion │ Observation│
│ Network │ Network │ Network │ Network │
├─────────────────┼─────────────────┼─────────────────┼────────────┤
│ External facts │ Agent's own │ Subjective │ Neutral │
│ about the world │ experiences & │ judgments with │ entity │
│ │ actions │ confidence │ summaries │
└─────────────────┴─────────────────┴─────────────────┴────────────┘
Mapped to Luminescent Cluster:
- World Network: Codebase structure, API contracts, dependencies
- Bank Network: What the AI has done, commits made, PRs reviewed
- Opinion Network: Code quality assessments, architecture preferences
- Observation Network: Service summaries, team conventions
Benefits:
- Separates facts from opinions (prevents hallucination bleed)
- Tracks agent's own actions (audit trail)
- Temporal filtering for time-sensitive queries
- Open source (Apache 2.0)
Tradeoffs:
- Significant architectural change
- New data model required
- Less mature than Mem0 (newer project)
Effort: High Impact: Very High for long-horizon agent tasks
Option D: Context Engineering Enhancements
What: Implement advanced context management strategies.
Industry Evidence:
- ACE framework: +10.6% on agents, +8.6% on finance benchmarks
- Google ADK's Context Compaction reduces latency/tokens
- Anthropic found isolated contexts outperform single-agent approaches
Strategies:
- Memory Blocks: Structure context into discrete functional units
┌─────────────────────────────────────────────────────────────────┐
│ Memory Block Layout │
├─────────────────────────────────────────────────────────────────┤
│ [System Block] │ Core instructions, persona │
│ [Project Block] │ Current project context, conventions │
│ [Task Block] │ Active task, goals, constraints │
│ [History Block] │ Compressed conversation history │
│ [Knowledge Block] │ Retrieved ADRs, incidents, code │
└─────────────────────────────────────────────────────────────────┘
- Context Compaction: Auto-summarize when threshold reached
- Selective Retrieval: Pull only relevant knowledge per query
- Context Isolation: Split complex tasks across subagents
Benefits:
- Reduces "context rot" from oversized windows
- Clear separation of concerns
- Enables monitoring and debugging
- Works with existing infrastructure
Tradeoffs:
- Requires client-side coordination
- MCP servers may not have visibility into full context
- Summarization can lose nuance
Effort: Medium Impact: Medium-High for efficiency
Option E: Memory as a Service (MaaS)
What: Shift from agent-bound memory to shared memory services for multi-agent collaboration.
Industry Evidence:
- Research shows agent memory silos hinder collaboration
- MaaS paradigm emerging for multi-agent systems
- Enables organizational memory shared across tools/agents
Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Memory as a Service (MaaS) │
├─────────────────────────────────────────────────────────────────┤
│ Shared Memory Layer │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Code KB │ │ Decision │ │ Incident │ │
│ │ Service │ │ Service │ │ Service │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
├─────────┼──────────────┼──────────────┼──────────────────────────┤
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Claude │ │ GPT Agent │ │ Custom │ │
│ │ Code │ │ │ │ Pipeline │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────────┘
Benefits:
- Memory persists regardless of which agent/LLM is used
- Enables handoff between specialized agents
- Organizational knowledge accessible to all tools
- Natural fit for MCP's server architecture
Tradeoffs:
- Access control complexity
- Consistency across agents
- Security considerations (MEXTRA attack vulnerabilities)
Effort: Medium Impact: High for multi-agent workflows
Option F: LangMem Integration
What: Integrate LangChain's LangMem SDK for procedural, episodic, and semantic memory types.
Industry Evidence:
- Native integration with LangGraph (popular agent framework)
- DeepLearning.AI course validates approach
- MongoDB, Redis integrations available
- Part of LangChain's production ecosystem
Memory Types:
┌─────────────────────────────────────────────────────────────────┐
│ LangMem Memory Types │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ Procedural │ Episodic │ Semantic │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ How to do tasks │ Specific events │ General facts │
│ (coding styles, │ (this PR, that │ (architecture, │
│ conventions) │ incident) │ team knowledge) │
└─────────────────┴─────────────────┴─────────────────────────────┘
Benefits:
- Well-documented, production-ready
- Multiple persistence backends
- Checkpointing and time-travel
- Large community support
Tradeoffs:
- LangChain dependency
- May not integrate directly with MCP
- Overlaps with Pixeltable functionality
Effort: Medium Impact: Medium for LangGraph users
Options Comparison Matrix
Weighted Decision Matrix
| Criterion (Weight) | A: HybridRAG | B: Mem0 | C: Hindsight | D: Context Eng | E: MaaS | F: LangMem |
|---|---|---|---|---|---|---|
| Accuracy (25%) | ★★★★★ | ★★★☆☆ | ★★★★★ | ★★☆☆☆ | ★★★☆☆ | ★★★☆☆ |
| Dev Experience (30%) | ★★★☆☆ | ★★★★★ | ★★★☆☆ | ★★★★☆ | ★★★★☆ | ★★★★☆ |
| Effort/Risk (20%) | ★☆☆☆☆ | ★★★★☆ | ★☆☆☆☆ | ★★★★★ | ★★★☆☆ | ★★★★☆ |
| Maturity (15%) | ★★★☆☆ | ★★★★☆ | ★★☆☆☆ | ★★★★★ | ★★★☆☆ | ★★★★☆ |
| Flexibility (10%) | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | ★★★★★ | ★★★★☆ | ★★★★☆ |
| Weighted Score | 2.95 | 3.70 | 2.80 | 3.80 | 3.35 | 3.60 |
Quantitative Metrics (Hypotheses)
| Option | Accuracy Gain | Latency Impact | Effort | Infrastructure | Best For |
|---|---|---|---|---|---|
| A. HybridRAG | +180% (2.8x)* | +50ms | High | Graph DB | Multi-hop code queries |
| B. Mem0 | +26% | -91% p95 | Low-Med | Cloud/Self-host | User personalization |
| C. Hindsight | +40%** | Neutral | High | New data model | Long-horizon agents |
| D. Context Eng | +10-30% | -30% | Medium | None | Immediate optimization |
| E. MaaS | N/A | Neutral | Medium | API layer | Multi-agent workflows |
| F. LangMem | Moderate | Neutral | Medium | LangChain | LangGraph integration |
*Based on Microsoft GraphRAG research (2024); requires validation on code domains **Estimated from LongMemEval 91.4% vs ~65% baseline RAG
Additional Industry Developments (Council Additions)
Based on council feedback, the following December 2025 developments were identified as missing:
Option G: Context Caching (Provider-Side)
What: Leverage provider-side context caching (Anthropic, OpenAI, Google) to avoid re-sending stable project context.
Industry Evidence:
- By late 2025, all major providers offer context caching
- Cost reduction: 75-90% for repeated project context
- Latency reduction: Eliminate round-trip for cached prefixes
Implementation:
┌─────────────────────────────────────────────────────────────────┐
│ Context Caching Flow │
├─────────────────────────────────────────────────────────────────┤
│ [Stable Context] │ [Dynamic Context] │
│ • Project architecture │ • Current task │
│ • ADRs & conventions │ • Recent code changes │
│ • Team patterns │ • User query │
│ ↓ │ ↓ │
│ CACHED (TTL: 1 hour) │ SENT FRESH │
└─────────────────────────────────────────────────────────────────┘
Effort: Low Impact: High for cost/latency
Option H: Episodic Rollback ("Git for Memory")
What: Version control for memory state, enabling rollback when agents go wrong.
Industry Evidence:
- Developer expectation from git workflows
- Critical for debugging agent mistakes
- Enables "what if" exploration
Implementation:
# Memory checkpoint API
checkpoint = memory.create_checkpoint("before-refactor")
# ... agent makes decisions ...
memory.rollback_to(checkpoint) # Undo bad decisions
memory.diff(checkpoint, "HEAD") # Compare states
Effort: Medium Impact: High for agent reliability
Option I: Provenance & Governance
What: Every memory item carries source links, timestamps, confidence, and validity scope.
Industry Evidence:
- Enterprise requirement for audit trails
- Prevents hallucination from unverified sources
- Enables "why does the AI think this?" debugging
Schema:
@dataclass
class MemoryItem:
content: str
source: str # "PR #402", "ADR-005", "user:chris"
created_at: datetime
expires_at: datetime # TTL-based expiration
confidence: float # 0.0-1.0
verified_by: str # Human approval if required
Effort: Medium Impact: High for enterprise trust
Option J: Retrieval-Augmented Thoughts (RAT)
What: Interleave retrieval during chain-of-thought, not just before generation.
Industry Evidence:
- Research shows improved reasoning with mid-thought retrieval
- Addresses "lost in the middle" problem
- More accurate for complex technical queries
Flow:
Traditional RAG: [Retrieve] → [Think] → [Respond]
RAT: [Think] → [Retrieve] → [Think more] → [Retrieve] → [Respond]
Effort: High (requires reasoning model integration) Impact: High for complex queries
Option K: Temporal Memory Decay (Forgetting Curve)
What: Implement forgetting mechanisms so old/unused memories decay in relevance.
Industry Evidence:
- "Relevance pollution" is a real problem at scale
- Old decisions may conflict with new ones
- Memory systems need pruning strategies
Implementation:
# Decay function
relevance = base_relevance * exp(-λ * days_since_access)
# Re-access refreshes relevance
# Low-relevance items deprioritized in retrieval
Effort: Low-Medium Impact: Medium for long-term maintenance
Recommended Roadmap (Council Revised)
Critical Change: The council unanimously agreed the original roadmap was "Optimization before Foundation." The revised roadmap prioritizes quick wins and foundations before heavy architectural lifts.
Phase 0: Foundations (Required First)
Purpose: Cannot optimize what you cannot measure
Deliverables:
-
Evaluation Harness
- Fixed task set + automated scoring
- Retrieval quality metrics (precision@k, citation correctness)
- Latency/cost instrumentation
- Contradiction/hallucination tests
-
HNSW Recall Health Monitoring (Council Addition - January 2026)
Critical Risk: HNSW approximate search silently degrades as the database grows. No errors are raised—the system appears healthy while retrieval quality deteriorates.
Requirement Target Notes Recall@k Metric Measured against brute-force exact search Golden query set (50 queries) Absolute Threshold Recall@10 ≥ 0.90 Alert if below Relative Drift ≤ 5% drop from baseline Alert on regression Filtered Search Evaluate with tenant/tag filters Prevents "filter-induced recall collapse" Reindex Trigger Auto-VACUUM when Recall < threshold Automated maintenance Retuning Milestones:
- 10k items: Benchmark and log
- 50k items: Benchmark, alert, consider
ef_searchtuning - 100k items: Mandatory review, consider index rebuild
Embedding Versioning (Council Required):
- Version-tag all embeddings with model identifier
- On model change: flag for re-embedding
- Index_v1 cannot serve traffic for Model_v2
-
Memory Schema & Lifecycle
- Define memory types (decisions, facts, procedures, preferences)
- TTL and expiration policies
- Versioning strategy (Option H foundation)
- "Source of truth" policy
-
Governance & Observability
- Trace: retrieval → context assembly → model output
- Log which memories were used and why
- Audit trail for memory changes
- Access control framework
Exit Criteria: Baseline metrics established, governance policies documented
Phase 1: Conversational Memory (Pixeltable Native) - 8 WEEKS
Decision: Build on Pixeltable (not Mem0 integration - see Architecture Decision above)
Purpose: Unified memory store with extraction capabilities
Critical Architecture: The "Janitor" Pattern (Council Mandated)
Running LLM extraction synchronously kills latency. We adopt a tiered approach:
┌─────────────────────────────────────────────────────────────────────────────┐
│ HOT / WARM / COLD MEMORY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TIER 1: HOT MEMORY (Real-time) │
│ ───────────────────────────── │
│ • Raw chat history retrieval │
│ • No extraction cost │
│ • Latency: <50ms │
│ │
│ TIER 2: WARM MEMORY (Async Extraction) │
│ ────────────────────────────────────── │
│ • Pixeltable computed columns with SMALL model (Llama-3-8B, Haiku) │
│ • Extraction runs AFTER response sent to user │
│ • Latency: Background (no user impact) │
│ │
│ TIER 3: COLD MEMORY (Scheduled Consolidation) │
│ ───────────────────────────────────────────── │
│ • "Janitor Process" - nightly batch job │
│ • Uses REASONING model (GPT-4o, Opus) for complex dedup │
│ • Merges facts, resolves contradictions, expires old data │
│ • Latency: Nightly (no user impact) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Deliverables (Extended to 8 Weeks per Council):
-
Phase 1a: Storage & Hot Memory (Weeks 1-2)
- Add
user_memorytable to Pixeltable schema - Add
conversation_memorytable with TTL support - Implement user/project scoping
- Basic CRUD through MCP tools
- Raw retrieval (Hot Memory tier)
- Add
-
Phase 1b: Async Extraction (Weeks 3-4)
- Create
extract_memory_facts()UDF using small model (Haiku/Llama-3-8B) - Async execution: Extract AFTER response sent
- Confidence scoring and thresholds
- Extraction determinism:
temperature=0 - Store raw source alongside extracted facts for re-processing
- Create
-
Phase 1c: Retrieval & Ranking (Weeks 5-6)
- Ranking logic: Hot facts vs Extracted facts
- Query rewriting for memory search
- Scope-aware retrieval (user vs project vs global)
-
Phase 1d: Janitor Process (Weeks 7-8)
- Nightly consolidation job using reasoning model (GPT-4o)
- Basic deduplication (>85% similarity threshold)
- Simple contradiction handling: newer wins, flag for review
- Memory decay: reduce relevance of unaccessed items
- Defer complex contradiction resolution to Phase 2
Schema Definition:
# Pixeltable tables for conversational memory
user_memory = pt.Table(
"user_memory",
columns={
"user_id": pt.String,
"content": pt.String,
"memory_type": pt.String, # preference, fact, decision
"confidence": pt.Float,
"source": pt.String, # conversation_id, manual
"raw_source": pt.String, # Original text for re-extraction
"extraction_version": pt.Int, # For re-processing on prompt updates
"created_at": pt.Timestamp,
"last_accessed_at": pt.Timestamp, # For decay scoring
"expires_at": pt.Timestamp, # TTL support
},
computed_columns={
"embedding": embed_text(content),
"scope": memory_scope(user_id, project_id),
}
)
Validation Metrics (Council Required):
| Metric | Target | Measurement |
|---|---|---|
| Hot memory latency | <50ms p95 | Instrumentation |
| Extraction precision | >85% | Golden Dataset eval |
| Retrieval relevance | >90% rated "helpful" | Blind evaluation |
| Storage efficiency | <10% duplicates post-consolidation | Automated check |
| Query latency | <200ms p95 | Instrumentation |
Golden Dataset (Council Required): Create 50 static questions representing real project queries:
- "What database do we use?"
- "Why did we reject MongoDB?"
- "What's our logging convention?"
- "Who approved the Kafka decision?"
Run regression tests against Golden Dataset on every PR.
Exit Criteria:
-
85% accuracy on Golden Dataset
- Zero cross-user memory leakage
- Janitor process completes in <10 minutes for 10k memories
Decision Reversal Trigger: If Week 4 checkpoint shows extraction precision <70%, evaluate:
- Scoping down consolidation features
- Revisiting Mem0 for extraction-only (not storage)
Phase 2: Context Engineering Optimization
Options: D (Context Engineering) + I (Provenance) + K (Temporal Decay)
Purpose: Optimize retrieval and context assembly
Deliverables:
-
Memory Blocks Architecture
[System Block] │ Core instructions, persona
[Project Block] │ Current project context, conventions
[Task Block] │ Active task, goals, constraints
[History Block] │ Compressed conversation history
[Knowledge Block] │ Retrieved ADRs, incidents, code -
Retrieval Improvements
- Query rewriting for better matches
- Reranking layer for relevance
- Selective injection based on query type
- Contextual compression for long memories
-
Provenance Tracking
- Source links for all memories
- Confidence scoring
- Temporal decay implementation
-
Grounded Memory Ingestion (Council Addition - January 2026)
Risk: The
/memorizecommand can pollute long-term memory with hallucinations or unsourced claims. "Garbage in, garbage forever."Tiered Provenance Model (Council Recommended over NLI):
┌─────────────────────────────────────────────────────────────────┐
│ GROUNDED INGESTION TIERS │
├─────────────────────────────────────────────────────────────────┤
│ TIER 1: Auto-approve (high confidence) │
│ • Content with explicit ADR/commit/doc links │
│ • User-stated facts about their own project │
│ • Decision discussions with clear context │
├─────────────────────────────────────────────────────────────────┤
│ TIER 2: Flag for review (medium confidence) │
│ • AI-synthesized claims without citations │
│ • Factual assertions about external systems/APIs │
│ → Queue for user confirmation before promotion │
├─────────────────────────────────────────────────────────────────┤
│ TIER 3: Block (low confidence) │
│ • Speculative content ("maybe", "might", "could be") │
│ • Content that contradicts existing memory │
│ → Reject with explanation, surface conflicts │
└─────────────────────────────────────────────────────────────────┘Evidence Object Schema:
class EvidenceObject:
claim: str # The memory content
source_id: Optional[str] # ADR-XXX, commit hash, URL
capture_time: datetime # When captured
validity_horizon: Optional[datetime] # Expiration if time-bound
confidence: Literal["high", "medium", "low"]Validation Checks (lightweight, no external model calls):
- Citation presence: regex for
[ADR-XXX], commit hashes, URLs - Hedge word detection: reject speculative language
- Deduplication: cosine similarity >0.92 rejects as redundant
- Contradiction detection: deferred to Phase 3 (requires reasoning)
- Citation presence: regex for
Exit Criteria:
- 30% token efficiency improvement
- Provenance available for all retrieved items
- Stale memory detection operational
- Zero hallucination write-back in grounded ingestion tests
Phase 3: Knowledge Structure (HybridRAG)
Options: A (HybridRAG with Knowledge Graph)
Purpose: Multi-hop reasoning for complex code queries
Prerequisites: Stable data from Phase 1-2, validated need for graph queries
Deliverables:
- Knowledge graph for code dependencies
- Entity extraction pipeline (async, not blocking)
- Reciprocal Rank Fusion (vector + graph)
- Neural reranker for final results
Hybrid Search Architecture (Council Addition - January 2026)
Intent: Combine vector similarity with graph relationships and keyword matching. Specific parameters to be tuned based on Phase 2 learnings.
┌─────────────────────────────────────────────────────────────────────┐
│ TWO-STAGE RETRIEVAL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ STAGE 1: Candidate Generation (parallel) │
│ • Vector Similarity (Dense) │
│ • Keyword BM25 (Sparse) │
│ • Knowledge Graph Traversal (Relationship-based) │
├─────────────────────────────────────────────────────────────────────┤
│ STAGE 2: Fusion + Reranking │
│ • Reciprocal Rank Fusion (RRF): Merge ranked lists │
│ • Cross-Encoder Reranker: Score top candidates on relevance │
│ • Return top 5 to context window │
└─────────────────────────────────────────────────────────────────────┘
Architectural Decisions (to be validated in Phase 2):
| Component | Intent | Parameters TBD |
|---|---|---|
| RRF Formula | Σ 1/(k + rank_i) | k value (standard: 60) |
| Multi-Query Expansion | Generate semantic variants | Max 3 variants (cost cap) |
| Candidate Pool | Broad retrieval before filtering | Top 50-100 candidates |
| Reranker | Cross-encoder for final ordering | Model choice TBD |
Phase 2 Learnings Required Before Implementation:
- What percentage of queries require graph traversal?
- What's keyword search hit rate vs vector search?
- Where are retrieval failures occurring? (measure first, optimize second)
Target Queries (justify the investment):
"What services depend on auth that had incidents last month?"
"Show me all code that changed because of ADR-005"
"Which team owns the service that calls the failing endpoint?"
Exit Criteria:
- Multi-hop queries outperform pure vector by >50%
- Latency <1s for graph-augmented queries
- Entity extraction runs async (no chat blocking)
Phase 4: Advanced Capabilities
Options: C (Hindsight), E (MaaS), J (RAT)
Purpose: Future-state capabilities based on validated needs
Conditional Entry:
- Hindsight (C): Only if Phase 3 reveals need for temporal/causal reasoning
- MaaS (E): Only if multi-agent architecture is adopted
- RAT (J): Only if complex reasoning queries dominate usage
Approach: Cherry-pick techniques from Hindsight rather than wholesale adoption
Phase 4.2: MaaS Trust Model (Council-Documented Decision)
Context: The LLM Council raised design-level concerns during security verification:
- Callers can specify arbitrary capabilities
- Agents can join pools by knowing the pool ID
- Arbitrary
owner_idassignment without verification
Decision: These are intentional architectural decisions, not security gaps.
Trust Boundary: MaaS APIs are internal interfaces, not public-facing endpoints.
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRUST BOUNDARY MODEL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ EXTERNAL (Untrusted) │ INTERNAL (Trusted) │
│ ───────────────────── │ ──────────────────── │
│ • End users │ • MCP Server layer │
│ • External APIs │ • CLI orchestrator │
│ • Network requests │ • Extension Registry │
│ │ • MaaS Registries │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌───────────────┐ │ ┌───────────────┐ │
│ │ Auth Layer │─────────────┼─▶│ Orchestrator │ │
│ │ (MCP Server) │ │ │ (Trusted) │ │
│ └───────────────┘ │ └───────────────┘ │
│ │ │ │
│ Authentication happens HERE │ ▼ │
│ │ ┌───────────────┐ │
│ │ │ MaaS APIs │ │
│ │ │ (No auth) │ │
│ │ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
What MaaS APIs Assume (Trusted Orchestrator Pattern):
| Assumption | Rationale |
|---|---|
owner_id is verified | Orchestrator authenticated the user before calling MaaS |
| Capabilities are appropriate | Orchestrator determines agent permissions based on auth context |
| Pool IDs are authorized | Orchestrator controls which pool IDs are exposed to which agents |
| Agent IDs are valid | Orchestrator manages agent lifecycle and provides valid IDs |
What MaaS APIs Enforce (Defense in Depth):
| Enforcement | Implementation |
|---|---|
| Capability checks | AgentIdentity.has_capability() - agents can only perform allowed actions |
| Scope hierarchy | SharedScope enum - agents cannot read above their max scope |
| Capacity limits | RegistryCapacityError, PoolCapacityError, HandoffCapacityError |
| Audit logging | All operations logged via MaaSAuditLogger for forensics |
| ID entropy | 128-bit UUIDs prevent ID guessing attacks |
| Defensive copies | All inputs/outputs copied to prevent state mutation |
| Handoff lifecycle | State machine prevents invalid transitions |
What MaaS APIs Do NOT Enforce:
| Not Enforced | Why |
|---|---|
| User authentication | Handled by MCP server / CLI layer |
owner_id verification | Orchestrator responsibility (has auth context) |
| Pool membership authorization | Orchestrator controls pool ID exposure |
| Capability assignment policy | Business logic varies by deployment |
Rationale: This follows the same pattern as other internal registries:
ExtensionRegistry(src/extensions/registry.py) - no auth checksLocalMemoryProvider(src/memory/providers/local.py) - trustsuser_idMemoryStore- trusts caller-provided identifiers
Security Model Summary:
Authentication: MCP Server / CLI Layer (EXTERNAL)
Authorization: Orchestrator Layer (decides what to call)
Capability Checks: MaaS API Layer (enforces what agents CAN do)
Audit Trail: MaaS Audit Logger (records what DID happen)
Council Verification: This trust model was documented following LLM Council feedback during v0.3.0 security hardening. The council's concerns were valid observations about the API surface, but represent intentional design decisions for internal interfaces rather than security vulnerabilities.
Roadmap Visualization
┌─────────────────────────────────────────────────────────────────────────────┐
│ ROADMAP TIMELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 │
│ Foundations Mem0 Context Eng HybridRAG Advanced │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Eval │ │ Memory │ │ Blocks │ │ Graph │ │Hindsight│ │
│ │ Schema │─────▶│ Policy │─────▶│ Rerank │─────▶│ Vector │───▶│ MaaS │ │
│ │ Govern │ │ Cache │ │ Decay │ │ Fusion │ │ RAT │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │
│ │
│ [Must Have] [Quick Win] [Optimization] [Conditional] [Future] │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Tier Constraints (ADR-004): Memory persistence depth and retention policies vary by monetization tier. Free tier has local-only storage; Team tier includes cloud persistence with GDPR auto-deletion; Enterprise tier supports configurable retention with legal hold. See ADR-004 for tier definitions and ADR-007 for implementation details.
Implementation Tracker
| Phase | Component | Status | Notes |
|---|---|---|---|
| Phase 0 | Evaluation Harness | ✅ Complete | src/memory/evaluation/ - 28 tests |
| Phase 0 | Golden Dataset | ✅ Complete | tests/memory/golden_dataset.json - 50 questions |
| Phase 0 | Memory Schema & Lifecycle | ✅ Complete | src/memory/schemas/, src/memory/lifecycle/ - 36 tests |
| Phase 0 | Governance & Observability | ✅ Complete | src/memory/observability/ - 27 tests |
| Phase 0 | MemoryProvider Protocol | ✅ Complete | src/memory/protocols.py - ADR-007 compliant |
| Phase 0 | HNSW Recall Health Monitoring | ✅ Complete | src/memory/evaluation/recall_health.py, brute_force.py, baseline.py - 37 tests |
| Phase 0 | Embedding Version Tracking | ✅ Complete | src/memory/evaluation/embedding_version.py - Council verified |
| Phase 0 | Reindex Trigger | ✅ Complete | src/memory/maintenance/reindex_trigger.py - Auto-reindex on threshold breach |
| Phase 1 | Session Memory MCP | ✅ Complete | session_memory_server.py - 45 tests |
| Phase 1 | Pixeltable MCP | ✅ Complete | pixeltable_mcp_server.py - 62 tests |
| Phase 1a | Hot Memory Storage | ✅ Complete | src/memory/storage/, src/memory/providers/local.py - 27 tests |
| Phase 1a | Memory MCP Tools | ✅ Complete | src/memory/mcp/tools.py - 21 tests |
| Phase 1a | Hot Memory Latency | ✅ Complete | <50ms p95 target met - 6 tests |
| Phase 1b | Async Extraction | ✅ Complete | src/memory/extraction/ - 36 tests |
| Phase 1b | Extraction Precision | ✅ Complete | >85% target met (100% achieved) - 5 tests |
| Phase 1c | Retrieval & Ranking | ✅ Complete | src/memory/retrieval/ - 31 tests |
| Phase 1c | Query Rewriting | ✅ Complete | Synonym expansion for better recall |
| Phase 1c | Scope-Aware Retrieval | ✅ Complete | user > project > global hierarchy |
| Phase 1c | Retrieval Latency | ✅ Complete | <200ms p95 target met - 6 tests |
| Phase 1d | Janitor Process | ✅ Complete | src/memory/janitor/ - 24 tests |
| Phase 1d | Deduplication | ✅ Complete | >85% similarity threshold |
| Phase 1d | Contradiction Handling | ✅ Complete | "newer wins" with flagging |
| Phase 1d | Expiration Cleanup | ✅ Complete | TTL-based expiration |
| Phase 1d | Janitor Performance | ✅ Complete | <10 min for 10k target met - 5 tests |
| CI/Security | Memory Isolation | ✅ Complete | Zero cross-user leakage - 8 tests |
| CI/Security | Protocol Compliance | ✅ Complete | Three-layer testing - 23 tests |
| CI/Security | CI Workflow | ✅ Complete | .github/workflows/memory-evaluation.yml |
| Phase 2 | Memory Blocks Architecture | ✅ Complete | 5-block layout with XML delimiters |
| Phase 2 | Provenance Tracking | ✅ Complete | Full provenance on all retrievals |
| Phase 2 | Temporal Decay | ✅ Complete | Integrated in retrieval ranking |
| Phase 2 | History Compression | ✅ Complete | Line-preserving truncation |
| Phase 2 | Token Efficiency | ✅ Complete | 40% efficiency (>30% target) |
| Phase 2 | Exit Criteria Tests | ✅ Complete | 6 benchmark tests passing |
| Phase 2 | Grounded Memory Ingestion | ✅ Complete | 3-tier provenance model: src/memory/ingestion/ - 62 tests (security hardened) |
| Phase 3 | Two-Stage Retrieval | ✅ Complete | src/memory/retrieval/ - BM25 + Vector + RRF + Cross-Encoder |
| Phase 3 | Provider Integration | ✅ Complete | LocalMemoryProvider with optional hybrid retrieval |
| Phase 3 | Exit Criteria | ✅ Complete | <1s latency, multi-hop improvement - 15 benchmarks |
| Phase 3 | Entity Extraction | ✅ Complete | src/memory/extraction/entities/ - 81 tests, async pipeline, user_id auth |
| Phase 4 | Knowledge Graph | ✅ Complete | src/memory/graph/ - NetworkX backend, 96 tests |
| Phase 4 | Graph Types & Store | ✅ Complete | RelationshipType enum, GraphNode, GraphEdge, KnowledgeGraph |
| Phase 4 | Graph Builder | ✅ Complete | Builds graph from Memory entities with relationship inference |
| Phase 4 | Graph Search | ✅ Complete | Stage 1 candidate generation with traversal |
| Phase 4 | HybridRetriever Integration | ✅ Complete | 3-source RRF fusion (BM25 + Vector + Graph) |
| Phase 4 | LocalMemoryProvider Graph | ✅ Complete | Optional graph index management |
| Phase 4 | Hindsight Integration | ✅ Complete | src/memory/hindsight/ - 75 tests, four-network architecture |
| Phase 4 | Hindsight Types | ✅ Complete | NetworkType, TimeRange, TemporalEvent, TemporalMemory, StateChange |
| Phase 4 | Hindsight Timeline | ✅ Complete | Event storage, state reconstruction, time queries |
| Phase 4 | Hindsight Temporal Search | ✅ Complete | NL temporal query parsing, relevance scoring |
| Phase 4.2 | MaaS Architecture | ✅ Complete | src/memory/maas/ - 149 tests, Trust Model documented |
| Phase 4.2 | Agent Registry | ✅ Complete | registry.py - agent identity, capabilities, sessions |
| Phase 4.2 | Pool Registry | ✅ Complete | pool.py - shared memory pools, membership, permissions |
| Phase 4.2 | Handoff Manager | ✅ Complete | handoff.py - task handoffs, lifecycle management |
| Phase 4.2 | MCP Tools | ✅ Complete | mcp_tools.py - 15 async tools for agent/pool/handoff |
| Phase 4.2 | Security Hardening | ✅ Complete | 128-bit IDs, capacity limits, audit logging, defensive copies |
| Phase 4.2 | Trust Model | ✅ Complete | Council-documented architectural decisions (see Phase 4.2 section) |
| Option G | Context Caching | ✅ Complete | src/memory/retrieval/cache.py - 37 tests, LRU+TTL |
| Option G | RetrievalCache | ✅ Complete | LRU eviction, configurable TTL, thread-safe, metrics |
| Option G | Provider Integration | ✅ Complete | LocalMemoryProvider use_cache, auto-invalidation |
| Phase D | Production Optimization | ✅ Complete | src/memory/observability/ - monitoring and ops |
| Phase D | HNSW Scale Monitoring | ✅ Complete | scale_milestones.py - 10k/50k/100k thresholds - 36 tests |
| Phase D | RRF Weight Tuning | ✅ Complete | fusion.py weights parameter - 6 tests |
| Phase D | Graph Query Monitoring | ✅ Complete | graph_metrics.py - hop latency, size tracking - 27 tests |
| Phase D | Operations Runbook | ✅ Complete | docs/operations/memory-runbook.md |
Test Summary: 1282 memory tests passing (as of 2026-01-16) - 149 new MaaS tests
Legend: ✅ Complete | 🔄 Partial | 📝 Not Started | ❌ Blocked
Consequences
Positive
- AI assistants maintain context across sessions
- Reduced repetitive context loading (estimated 60-80% reduction)
- Institutional knowledge becomes searchable
- Consistent AI suggestions aligned with past decisions
- Multi-project organizational awareness
- Foundation for advanced multi-agent workflows
Negative
- Requires initial setup and ingestion
- Python version binding (mitigated by ADR-001)
- Storage growth with codebase size
- Embedding computation cost (mitigated by local models)
- New infrastructure to maintain (Pixeltable, MCP servers)
- Team learning curve on memory-augmented AI patterns
Neutral
- Requires MCP-compatible client
- Team adoption needed for full benefit
- ADR/incident discipline improves value
- Commits us to MCP protocol (reasonable bet given industry adoption)
- May require revisiting if context windows grow significantly larger
Risks and Mitigations (Council Additions)
Technical Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Memory Poisoning | Medium | High | Provenance tracking, confidence scoring, human approval gates for "decision-grade" memories |
| Staleness & Drift | High | High | Link memories to repo commits, TTL policies, periodic revalidation jobs, git hook triggers |
| Contradictions | Medium | Medium | Conflict resolution policy (prefer newer? prefer docs?), show both with citations |
| Latency Blowups | Medium | High | Budgets per tier, caching, "fast path" vs "deep reasoning path" separation |
| Relevance Pollution | High | Medium | Temporal decay (Option K), ranking algorithms, "forgetting curve" implementation |
| Schema Drift | High | Medium | Async re-indexing on commit, version migrations, schema evolution plan |
| HNSW Silent Recall Degradation (Jan 2026) | High | High | Recall@k monitoring against exact search baseline; automated reindex triggers; retuning milestones at 10k/50k/100k items |
| Filter-Induced Recall Collapse (Jan 2026) | Medium | High | Evaluate recall with metadata filters applied (not just unfiltered); test heavy filtering scenarios (tenant, date, doc_type) |
| Embedding Model Drift (Jan 2026) | Medium | High | Version-tag all embeddings; on model change, flag for re-embedding; strict index-to-model versioning |
| Retrieval Poisoning (Jan 2026) | Low | High | Treat retrieved context as untrusted input; sanitization layer before injecting into system prompt; filter malicious instruction patterns |
Organizational Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Scope Creep via Options | High | High | Hard constraint: Pick ≤3 options for 2025, phase gates required |
| Benchmark Gaming | Low | Medium | Include qualitative user feedback, test against real project history |
| Vendor Lock-in | Medium | Medium | Define internal memory APIs and adapters, avoid proprietary-only features |
| Operational Complexity | Medium | Medium | Document backup/migration procedures, plan embedding model upgrades |
Security & Privacy Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Memory Leakage | Medium | High | Strict namespace isolation, MCP scoping, audit logs for cross-project retrievals |
| MEXTRA Attacks | Low | High | Output filtering, memory de-identification, input validation |
| Secret Exposure | Low | Critical | Secret redaction in ingestion, .gitignore-style exclusion patterns |
Risk Assessment by Council Model
Gemini: "Memory update triggers (Git hooks) are required to keep memory synchronized. Without them, the Knowledge Graph becomes a hallucination source."
Claude: "Wrong context is worse than no context. Memory poisoning needs confidence scoring and human approval gates."
Grok: "Ensure memory systems don't perpetuate biases in persistent context via biased knowledge graphs."
GPT: "Memory systems need pruning strategies - relevance pollution is a real problem at scale."
Security Considerations
Per recent research (MEXTRA attack, February 2025), memory systems are vulnerable to:
- Prompt injection extracting stored memories
- Cross-session data leakage
Mitigations (to implement):
- User/session-level isolation
- Memory de-identification
- Output filtering
- Audit logging
Related Decisions
- ADR-001: Python Version Requirement (database integrity)
- ADR-002: Workflow Integration (automated ingestion)
References
- Mem0 Research: 26% Accuracy Boost
- Hindsight Agentic Memory - 91.4% LongMemEval
- HybridRAG: Integrating Knowledge Graphs
- MCP One Year Anniversary - November 2025 Spec
- Context Engineering for Agents
- HNSW at Scale: Why Your RAG System Gets Worse (Jan 2026)
- 12 Advanced Types of RAG (Jan 2026)
- LangMem SDK Launch
- Beyond RAG: Context Engineering
- ACE: Agentic Context Engineering
- Memory as a Service (MaaS)
Open Questions
The following questions have been investigated and resolved:
-
Single-tenant vs Multi-tenant: ✅ Resolved - Multi-tenancy is supported via the
TenantProviderprotocol (ADR-005) andCloudTenantProviderimplementation (ADR-007). See ADR-007 Section 1a for tenant isolation enforcement points. -
Memory Approval Workflows: ✅ Resolved - The
AuditLoggerprotocol (ADR-007) provides provenance tracking. GDPR compliance workflows inGDPRService(luminescent-cloud) implement approval gates for data deletion. -
Retention Policies: ✅ Resolved - ADR-007 Section 4 defines tier-specific retention:
- Free (OSS): User's responsibility (no auto-deletion)
- Team (Cloud): Auto-delete on workspace exit (GDPR Article 17)
- Enterprise: Configurable (legal hold support)
-
Embedding Model Upgrades: 🔄 Partially Resolved - Pixeltable computed columns support re-indexing. Full migration strategy deferred to Phase 3 (HybridRAG).
-
Conflict Resolution: 🔄 Partially Resolved - Phase 1d Janitor Process implements "newer wins" with flagging for review. Complex contradiction handling deferred to Phase 2 (Context Engineering).
Council Review Summary
Design Review (2025-12-22)
Council Configuration: High confidence (all 4 models) Models: Gemini-3-Pro, Claude Opus 4.5, Grok-4, GPT-5.2-Pro
Unanimous Recommendations (All 4 Models)
- Add Phase 0 for foundations, evaluation, and governance
- Reorder roadmap: Mem0 before HybridRAG
- Add quantified success metrics
- Add provenance/governance as core requirement
- Include context caching for cost optimization
Key Insights by Model
- Gemini: "Optimization before Foundation" is the core flaw
- Claude: "Wrong context is worse than no context"
- Grok: Emphasized bias prevention in knowledge graphs
- GPT: Highlighted need for "Phase 0" evaluation harness
Implementation Verification (2026-01-07)
Council Configuration: High confidence (all 4 models)
Models: GPT-5.2-Pro, Gemini-3-Pro, Claude Opus 4.5, Grok-4
Transcripts: .council/logs/2026-01-07T21-25-55-87316c79/, 2026-01-07T21-41-43-284ed67c/, 2026-01-07T21-49-29-4db02c46/
Round 1 (Commit 0c8c41a) - REJECTED
| Issue | Severity | Fix Applied |
|---|---|---|
EvaluationHarness defaulted success=True when no evaluate_fn | Critical | Changed to success=False |
| Precision/recall/F1 aliased to accuracy (not calculated) | Critical | Now uses metrics.py functions |
| O(N²) janitor complexity | High | Documented as known trade-off (meets <10min target) |
| Silent exception swallowing | High | Added error collection and reporting |
Round 2 (Commit e448f22) - REJECTED
| Issue | Severity | Fix Applied |
|---|---|---|
| Janitor fallback modified metadata but didn't persist | Critical | Removed broken fallback, use invalidate/delete only |
FP/FN indistinguishable (both set to failed) | Critical | Now tracks by retrieved_memories presence |
invalidated count incremented without action | Critical | Only increment on successful operation |
Round 3 (Commit 720a61b) - UNCLEAR (Confidence 0.54)
| Remaining Concern | Status | Rationale |
|---|---|---|
| O(N²) complexity | Acceptable | Meets ADR-003 target (<10min for 10k memories, actual: <3s) |
resolve_duplicates unused | By Design | Used internally by find_duplicates workflow |
| Keyword-based contradiction detection | Phase 1 Scope | Semantic analysis deferred to Phase 2 |
Hard-coded limit=10000 | Acceptable | Documented limitation for Phase 1 |
Round 4 (Commit 0efa8a2) - UNCLEAR (Confidence 0.61, No Blocking Issues)
Focus: ADR-005 Dual-Repo Compliance Fix
| Issue | Severity | Fix Applied |
|---|---|---|
MemoryProvider not exported from extensions module | Critical | Added to __init__.py imports and __all__ |
ResponseFilter not exported | Critical | Added to __init__.py imports and __all__ |
MEMORY_PROVIDER_VERSION not exported | Critical | Added to __init__.py imports and __all__ |
| Misleading docstring import path | Medium | Fixed to use src.extensions |
Brittle _is_runtime_protocol check | Medium | Changed to isinstance() test |
Outcome: No blocking issues. Rationale "approved" at 0.9 confidence.
Transcript: .council/logs/2026-01-07T22-43-42-b3f93afd/
Final Scores (Round 4)
- Accuracy: 10.0/10
- Completeness: 6.3/10
- Clarity: 7.0/10
- Blocking Issues: None
Split Decision (Rounds 1-3)
- GPT-5.2-Pro: REJECTED (strictest interpretation)
- Gemini-3-Pro: REJECTED (completeness concerns)
- Claude Opus 4.5: NEEDS REVIEW (acceptable for Phase 1)
- Grok-4: APPROVED (most lenient)
Related Decisions
- ADR-001: Python Version Requirement (database integrity)
- ADR-002: Workflow Integration (automated ingestion)
- ADR-004: Monetization Strategy (tier-based memory constraints)
- ADR-005: Repository Organization Strategy (OSS vs Paid separation, Extension Registry)
- ADR-006: Chatbot Platform Integrations (adapters implementing memory I/O)
- ADR-007: Cross-ADR Integration Guide (Phase Alignment Matrix, protocol consolidation)
Changelog
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-12-22 | Initial draft |
| 2.0 | 2025-12-22 | Added industry research and 6 improvement options |
| 3.0 | 2025-12-22 | Council Review: Revised roadmap (Phase 0 + reordering), added decision drivers, non-goals, success metrics table, interface contract, 5 additional options (G-K), comprehensive risk matrix, council feedback summary |
| 4.0 | 2025-12-22 | Council Review #2: Rejected Mem0 integration in favor of Pixeltable-native memory. Added "Build vs Integrate" decision section with code examples. Updated Phase 1 to Pixeltable Native approach. Added reference to ADR-004 (Monetization). |
| 4.1 | 2025-12-23 | Council Validation: Extended Phase 1 to 8 weeks. Added "Janitor" (Hot/Warm/Cold) architecture for async extraction. Added Golden Dataset requirement. Added validation metrics and decision reversal triggers. |
| 4.2 | 2025-12-28 | Cross-ADR Synchronization: Added ADR-006 and ADR-007 to Related Decisions. Updated Non-Goals to reflect multi-tenancy via Extension Registry. Added Interface Contract admonition linking to consolidated protocols. Marked Open Questions as resolved with references. Added Implementation Tracker section. Added tier constraints note to Roadmap. Council review (Grok-4, GPT-4o). |
| 4.3 | 2026-01-07 | Implementation Verification: Council reviewed Phase 0-1d implementation across 3 rounds. Fixed critical bugs in EvaluationHarness (metrics calculation) and Janitor (persistence, error handling). Added dry-run mode and soft-delete to janitor. Updated test count to 330. Added Implementation Verification section with Council transcripts. Phase 2+ remains NOT STARTED. |
| 4.4 | 2026-01-07 | ADR-005 Compliance Fix: Exported MemoryProvider, ResponseFilter, MEMORY_PROVIDER_VERSION from src/extensions/__init__.py. Council Round 4: No blocking issues, 10.0/10 accuracy. Fixed #117, unblocked #114. Test count: 1008 total (336 memory-specific). |
| 4.5 | 2026-01-08 | Phase 2 Complete: Implemented Memory Blocks Architecture with 5-block layout (System, Project, Task, History, Knowledge). Added provenance tracking on all retrievals, line-preserving truncation, XML-safe delimiters. Met all exit criteria: 40% token efficiency (>30% target), provenance on all items, stale detection operational. Test count: 436 memory tests. Council verified across 8 rounds. |
| 4.6 | 2026-01-08 | Security Hardening (Council Rounds 13-19): Comprehensive DoS prevention in ProvenanceService. Added: bounded LRU storage, string identifier length limits, metadata bounds validation, recursive nested structure validation with early termination, strict JSON type safety, cycle detection, UTF-8 byte size validation, TOCTOU prevention via deep copy, score range validation (0.0-1.0). Test count: 490 memory tests (64 provenance-specific security tests). |
| 4.7 | 2026-01-09 | Research-Driven Strategy Update: Council-validated additions based on January 2026 RAG research. Phase 0: Added HNSW Recall Health Monitoring (critical silent failure mode), recall@k against exact search baseline, retuning milestones at 10k/50k/100k items, embedding versioning. Phase 2: Added Grounded Memory Ingestion with 3-tier provenance model, Evidence Object schema. Phase 3: Added Two-Stage Retrieval Architecture intent (RRF, multi-query expansion, cross-encoder reranking). Risks: Added HNSW silent recall degradation, filter-induced recall collapse, embedding model drift, retrieval poisoning. References: HNSW at Scale (TDS), 12 RAG Types (TuringPost). |
| 4.8 | 2026-01-09 | HNSW Recall Health Complete: Implemented full Phase 0 HNSW monitoring suite. Added recall_health.py (RecallHealthMonitor, Recall@k computation), brute_force.py (exact cosine similarity ground truth), baseline.py (drift detection with atomic writes, history pruning), embedding_version.py (model version tracking). Security hardening: symlink protection, path containment, PII exclusion, salted hashing. Added ReindexTrigger for auto-reindex. Council verified across 31 rounds (0 blocking issues, 10/10 accuracy). Test count: 527 memory tests. Remaining: Grounded Memory Ingestion (Phase 2), Two-Stage Retrieval (Phase 3). |
| 4.9 | 2026-01-09 | Grounded Memory Ingestion Complete: Implemented Phase 2 3-tier provenance model to prevent hallucination write-back. Added src/memory/ingestion/ package with 8 modules: evidence.py (EvidenceObject), result.py (ValidationResult, IngestionTier), citation_detector.py (ADR/commit/URL regex), hedge_detector.py (speculative language blocking), dedup_checker.py (Jaccard similarity >0.92), validator.py (IngestionValidator orchestrator), review_queue.py (Tier 2 pending memories). Integrated into create_memory() MCP tool with bypass_validation option. Added review queue MCP tools: get_pending_memories, approve_pending_memory, reject_pending_memory. Exit criteria met: zero hallucination write-back in grounded ingestion tests. Test count: 575 memory tests (48 ingestion-specific). Remaining: Two-Stage Retrieval (Phase 3). |
| 5.0 | 2026-01-09 | Grounded Memory Ingestion Security Hardened: Council verification across 6 rounds identified and fixed critical security vulnerabilities. Fixes: (1) Hedge bypass via assertion markers - assertions no longer override is_speculative; (2) Dedup fail-open - now raises DedupCheckError, flags for review; (3) Cross-tenant data leak in get_review_history - requires user_id; (4) Cross-tenant DoS via eviction - rejects at capacity; (5) IDOR in get_by_id - requires authorization; (6) Strong speculation missed - added "i don't know"; (7) Race condition in approve - removes before callback; (8) Unbounded history growth - added MAX_HISTORY_ENTRIES. Council synthesis: "approved" at 0.9 confidence. Test count: 589 memory tests (62 ingestion-specific). |
| 6.0 | 2026-01-09 | Two-Stage Retrieval Architecture Complete (Phase 3): Implemented hybrid retrieval as specified in ADR-003 Phase 3. Stage 1 (Parallel Candidate Generation): bm25.py (BM25 sparse keyword search with tokenization, IDF, multi-tenant indexes), vector_search.py (dense semantic search with sentence-transformers, lazy loading, cosine similarity). Stage 2 (Fusion + Reranking): fusion.py (RRF algorithm with k=60, weighted fusion, fuse_with_details for provenance), reranker.py (cross-encoder reranking with ms-marco-MiniLM-L-6-v2, fallback reranker for fast mode). Orchestrator: hybrid.py (HybridRetriever with parallel Stage 1 via asyncio.gather, RRF fusion, optional reranking, RetrievalMetrics for latency tracking). Integration: Updated src/memory/retrieval/__init__.py exports. Exit Criteria Met: latency <1s, multi-hop queries outperform pure vector. Test count: 757 memory tests (168 new retrieval tests, 15 exit benchmarks). Remaining: Knowledge Graph (Phase 3), Entity Extraction (Phase 3). |
| 6.1 | 2026-01-09 | Provider Integration Complete: Integrated HybridRetriever into LocalMemoryProvider with backwards compatibility. Features: Optional use_hybrid_retrieval parameter (default: False), use_cross_encoder for quality vs speed tradeoff, use_query_rewriter for term expansion. New Methods: retrieve_with_scores() returns (Memory, score) tuples, retrieve_with_metrics() returns detailed RetrievalMetrics. Automatic Index Management: store/delete/clear operations sync with hybrid indexes. Bug Fix: Fixed hybrid.py to use query_rewriter.rewrite() instead of expand() (string vs list return type). Test count: 773 memory tests (16 new provider integration tests). Remaining: Knowledge Graph (Phase 3), Entity Extraction (Phase 3). |
| 6.2 | 2026-01-16 | Entity Extraction Complete (Phase 3): Implemented async entity extraction pipeline for knowledge graph support. New Module: src/memory/extraction/entities/ with 6 files. Types: EntityType enum (SERVICE, DEPENDENCY, API, PATTERN, FRAMEWORK, CONFIG), Entity dataclass with confidence scoring, EntityExtractor protocol. Extractors: MockEntityExtractor (pattern-based for testing), HaikuEntityExtractor (LLM-based for production). Pipeline: EntityExtractionPipeline with process() (blocking) and process_async() (fire-and-forget). Security: user_id authorization check prevents cross-user memory modification. Storage: Entities stored in Memory.metadata["entities"] for knowledge graph construction. Provider Fix: LocalMemoryProvider now supports metadata merge updates. GitHub Issues: Closed #118, #119, #120, #121. Test count: 854 memory tests (81 entity extraction tests incl. 2 security). TDD approach + council security review. Remaining: Knowledge Graph (Phase 4). |
| 6.3 | 2026-01-16 | Knowledge Graph Complete (Phase 4): Implemented multi-hop reasoning for queries like "What services depend on PostgreSQL?". New Module: src/memory/graph/ with 5 files. Types: RelationshipType enum (7 types: DEPENDS_ON, USES, IMPLEMENTS, CALLS, CONFIGURES, HAD_INCIDENT, OWNED_BY), GraphNode dataclass, GraphEdge dataclass. Store: KnowledgeGraph using NetworkX DiGraph, supports serialization. Builder: GraphBuilder creates graph from Memory entities with relationship inference. Search: GraphSearch for Stage 1 candidate generation with traversal scoring (direct match 1.0, neighbor 0.7, predecessor 0.6). HybridRetriever: Updated for 3-source RRF fusion (BM25 + Vector + Graph), metrics track graph_candidates. LocalMemoryProvider: Optional use_graph=True parameter, automatic graph building from stored memories. Exit Criteria Met: <1s latency, multi-hop queries outperform pure vector. GitHub Issues: Closed #122, #123, #124, #125, #126. Test count: 955 memory tests (96 new Knowledge Graph tests). TDD approach with 11 integration tests. |
| 6.4 | 2026-01-16 | Hindsight Temporal Memory Complete (Phase 4): Implemented four-network temporal memory for time-based queries. New Module: src/memory/hindsight/ with 4 files. Types: NetworkType enum (WORLD, BANK, OPINION, OBSERVATION), TimeRange with relative time support, TemporalEvent with validity periods, TemporalMemory wrapper, StateChange for transitions. Timeline: Event storage with indexing by entity/network/time, state reconstruction at any point in time. Temporal Search: NL temporal query parsing ("last month", "Q4 2025", "before incident-123"), relevance scoring with temporal context. Target Queries: "What changed last month?", "What was auth-service status before incident-123?", "Show me decisions made in Q4 2025". Council Review: Four-network architecture validated, security recommendations noted, HybridRetriever integration recommended. GitHub Issues: Closed #123, #124, #125, #126. Test count: 1030 memory tests (75 new Hindsight tests). TDD approach. |
| 6.5 | 2026-01-16 | Context Caching Complete (Option G): Implemented provider-side caching for 75-90% cost reduction. RetrievalCache: src/memory/retrieval/cache.py with LRU eviction, configurable TTL (default 3600s), thread-safe operations (RLock), per-user invalidation, hit/miss metrics, serialization support. Provider Integration: LocalMemoryProvider use_cache parameter (default: False), automatic invalidation on store/delete/clear, get_cache_metrics() for monitoring. Council Review: Unanimous approval, invalidation strategy validated, auth drift concern noted (recommend invalidate on permission changes), default TTL 5 minutes recommended. GitHub Issues: Closed #127, #128. Test count: 1067 memory tests (37 new Caching tests). TDD approach. |
| 6.6 | 2026-01-16 | Production Optimization Complete (Phase D): Implemented operational monitoring and documentation for production readiness. HNSW Scale Monitoring: src/memory/observability/scale_milestones.py with ScaleMilestoneTracker (10k/50k/100k items), MilestoneCheckResult for health checks, configurable recall/latency thresholds, callback-based health check triggers. RRF Weight Tuning: Updated fusion.py with per-source weights parameter for tunable multi-source retrieval. Graph Query Monitoring: src/memory/observability/graph_metrics.py with GraphMetricsCollector (p50/p95/p99 latency), hop-level latency breakdown (direct/neighbor/predecessor), QueryMeasurementContext for easy instrumentation, GraphSizeSnapshot for node/edge tracking. Operations Runbook: docs/operations/memory-runbook.md with architecture overview, key metrics, troubleshooting guides, scaling guidelines, health check procedures. GitHub Issues: Closed #129, #130, #131. Test count: 1133 memory tests (69 new Phase D tests: 36 scale milestones, 6 RRF weights, 27 graph metrics). TDD approach. |
| 6.7 | 2026-01-16 | MaaS Trust Model Documented (Phase 4.2): Added Trust Model section documenting architectural decisions raised by LLM Council. Trust Boundary: MaaS APIs are internal interfaces called by trusted orchestrator layer. What's Enforced: Capability checks, scope hierarchy, capacity limits (DoS prevention), audit logging, 128-bit ID entropy, defensive copies, handoff lifecycle. What's NOT Enforced: Authentication (MCP server layer), owner_id verification (orchestrator responsibility), pool membership authorization (orchestrator controls exposure). Rationale: Follows same pattern as ExtensionRegistry, LocalMemoryProvider. Implementation Tracker: Updated with 7 MaaS completion entries. Test count: 1282 memory tests (149 MaaS tests). Council feedback incorporated. |