Skip to main content

ADR-003: Project Intent - Persistent Technical Context for AI Development

Status: Accepted Date: 2025-12-22 Decision Makers: Development Team Owners: @christopherjoseph Version: 4.2 (Cross-ADR Synchronized)

Decision Summary

We adopt a Memory-First architecture using MCP, implementing a phased approach that prioritizes foundations and quick wins before architectural complexity:

  1. Phase 0: Evaluation, governance, and observability foundations
  2. Phase 1: Mem0 for immediate personalization and conversational memory
  3. Phase 2: Context Engineering for optimized retrieval
  4. Phase 3: HybridRAG with Knowledge Graph for complex queries
  5. Phase 4: Advanced features (Hindsight, MaaS) as needed

We reject simple Vector-only RAG as insufficient for code logic and dependency reasoning.


Critical Architecture Decision: Build vs. Integrate (Mem0)

The Question

Should we integrate Mem0 for conversational memory, or build equivalent capabilities on Pixeltable?

CapabilityPixeltableMem0
Vector embeddings
Semantic search
Structured metadata
Graph relationshipsPartial✓ (Mem0g)
User/session scopingManualBuilt-in
Memory extractionManualAutomatic

Council Decision: Build on Pixeltable (Option B)

All 4 models unanimously agreed: Pixeltable must remain the canonical source of truth.

The "Split Brain" Problem

Integrating Mem0 creates two vector databases:

User Query → MCP Server → ???
├→ Pixeltable (code, ADRs, incidents)
└→ Mem0 (conversations, preferences)
└→ Its own vector DB

Problems this creates:

  • Two sources of truth with different consistency models
  • Cross-system queries for "What did we decide about auth?" are complex
  • Different score scales make result merging difficult
  • Debugging spans two systems with different logging

Why Pixeltable Can Replace Mem0

Mem0's primary value (extraction + scoping) can be replicated using Pixeltable's computed columns:

# Memory extraction via LLM UDF (computed column)
@pt.computed_column
def extract_memory_facts(row) -> list[dict]:
"""Extract memorable facts from conversation."""
prompt = """
Analyze this conversation and extract:
1. User preferences (explicit and implicit)
2. Decisions made
3. Facts learned about the codebase
4. Corrections to previous understanding
Return as structured JSON with confidence scores.
"""
return llm_extract(prompt, row.content)

# User/session scoping
@pt.computed_column
def memory_scope(row) -> dict:
return {
"user_id": row.user_id,
"project_id": row.project_id,
"scope_hierarchy": [
f"user:{row.user_id}",
f"project:{row.project_id}",
"global"
]
}

# Memory consolidation (dedup, merge, supersede)
class MemoryConsolidator:
async def consolidate(self, new_memory: Memory) -> ConsolidationResult:
similar = await self.find_similar(new_memory, threshold=0.85)
if not similar:
return ConsolidationResult(action="insert", memory=new_memory)
# Handle contradictions, merges, supersedes...

Decision Matrix

CriteriaA) Integrate Mem0B) Build on Pixeltable
Time to MVP2-3 weeks4-6 weeks
Maintenance burdenHigh (two systems)Low (one system)
Feature velocityFast initially, slowsSlower initially, compounds
Architectural simplicityPoorExcellent
Long-term flexibilityLimited by Mem0's roadmapFull control
DebuggingHigh complexityLow complexity
Data consistencyTwo sources of truthSingle source

Accepted Trade-offs

  • Longer initial development (6 weeks vs 2-3 weeks)
  • Must build extraction/consolidation logic ourselves
  • No access to Mem0's ongoing R&D

Validation Criteria (Phase 1 Exit)

  • Memory retrieval latency < 200ms p95
  • Extraction accuracy > 80% (manual evaluation)
  • Zero cross-user memory leakage

Context

The Problem: Ephemeral AI Context

Large Language Models (LLMs) suffer from a fundamental limitation: context window amnesia. Each conversation starts fresh, forcing developers to repeatedly explain:

  • Project architecture and design decisions
  • Coding conventions and patterns
  • Past incidents and their resolutions
  • Domain-specific terminology
  • Team decisions and their rationale

This creates three significant pain points:

  1. Repetitive Context Loading: Developers waste time re-establishing context every session
  2. Lost Institutional Knowledge: Valuable decisions and learnings evaporate between conversations
  3. Inconsistent Assistance: Without historical context, AI suggestions may contradict past decisions

The Vision: Persistent Technical Memory

Luminescent Cluster aims to give AI assistants persistent technical memory - the ability to recall project context, architectural decisions, incident history, and codebase knowledge across sessions and even across different LLM providers.

Industry Context (December 2025)

The LLM memory landscape has evolved rapidly. Key developments:

  • MCP Standardization: The Model Context Protocol is now the de-facto standard, adopted by OpenAI, Google DeepMind, Microsoft, and AWS. In December 2025, Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation (AAIF).
  • Beyond RAG: Traditional RAG is being challenged by agentic memory architectures that maintain context over time, track evolving beliefs, and perform temporal reasoning.
  • Production Scale: Systems like Mem0 have achieved 26% accuracy improvements with 91% lower latency and 90% token savings at enterprise scale (186M API calls/Q3 2025).
  • Context Engineering: A new discipline focused on "the delicate art and science of filling the context window with just the right information" (Andrej Karpathy).

Decision Drivers (Ranked)

PriorityDriverWeightRationale
1Developer Experience30%Reduce repetition friction; faster onboarding
2Accuracy25%Retrieved context must be relevant and correct
3Token Efficiency20%Context window is expensive; minimize waste
4Implementation Velocity15%Need wins this quarter to validate approach
5Future Flexibility10%Don't lock in prematurely; enable pivots

Non-Goals

This ADR explicitly does not address:

  • Multi-tenant implementation details: Multi-tenancy is technically supported via the Extension Registry (ADR-005) and unified in the integration layer (ADR-007). This ADR remains focused strictly on memory architecture; tenant isolation is treated here as an external constraint rather than a core feature
  • General documentation RAG: Focus is on technical context, not help docs
  • AGI-style continuous learning: Memory is curated, not autonomous
  • PII/secrets storage: Sensitive data excluded by policy
  • Training custom models: Memory augments existing LLMs, doesn't train new ones

Success Metrics (Quantified)

MetricBaselineTargetMeasurement
Context re-explanation frequency~5/session<1/sessionUser survey
Context retrieval latencyN/A<500ms p95Instrumentation
Retrieval precision@5N/A>85%LongMemEval subset
Token efficiencyN/A<30% of window for memoryContext analysis
Multi-hop query accuracyN/A>75%Internal benchmark
Developer satisfaction (NPS)N/A>50Quarterly survey

Decision

We implement a three-tier memory architecture exposed via the Model Context Protocol (MCP):

Tier 1: Session Memory (Hot Context)

Purpose: Fast access to current development state Implementation: session_memory_server.py Data Sources:

  • Git repository state (current branch, status)
  • Recent commits (last 200)
  • Current diff (staged/unstaged changes)
  • File change history
  • Active task context (set by user/agent)

Characteristics:

  • Latency: <10ms (in-memory)
  • Scope: Current repository
  • Persistence: Ephemeral (session-bound)
  • Cost: Zero (local computation)

Tier 2: Long-term Memory (Persistent Knowledge)

Purpose: Semantic search over organizational knowledge Implementation: pixeltable_mcp_server.py + pixeltable_setup.py Data Sources:

  • Code repositories (multi-service)
  • Architectural Decision Records (ADRs)
  • Production incident history
  • Meeting transcripts
  • Documentation

Characteristics:

  • Latency: 100-500ms (semantic search)
  • Scope: Entire organization, cross-project
  • Persistence: Durable (survives restarts)
  • Cost: Embedding generation (local sentence-transformers by default)

Tier 3: Intelligent Orchestration

Purpose: Efficient multi-tool coordination Mechanisms:

  • Tool Search: On-demand tool discovery (85% token reduction)
  • Programmatic Tool Calling: Batch operations in sandbox (37% token reduction)
  • Deferred Loading: Heavy tools loaded only when needed

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ AI Assistant (Claude, etc.) │
├─────────────────────────────────────────────────────────────────┤
│ Model Context Protocol (MCP) │
├────────────────┬────────────────────────┬───────────────────────┤
│ Tier 1 │ Tier 2 │ Tier 3 │
│ Session Memory │ Long-term Memory │ Orchestration │
├────────────────┼────────────────────────┼───────────────────────┤
│ • Git state │ • Pixeltable DB │ • Tool Search │
│ • Recent commits│ • Semantic embeddings │ • Programmatic Calls │
│ • Current diff │ • Multi-project index │ • Deferred Loading │
│ • Task context │ • ADRs, incidents │ │
└────────────────┴────────────────────────┴───────────────────────┘

Interface Contract (MCP)

Implementation Note: These interface definitions represent the conceptual architectural vision. For live, versioned protocol signatures (including ContextStore, TenantProvider, and AuditLogger), refer to src/extensions/protocols.py as consolidated in ADR-007. Chatbot-specific implementations of these interfaces are provided by the adapters defined in ADR-006.

The memory system exposes the following MCP resources and tools:

Resources (Read-Only Context)

project://memory/recent_decisions     # Last N architectural decisions
project://memory/active_incidents # Open incidents affecting this service
project://memory/conventions # Coding patterns and team standards
project://memory/dependency_graph # Service relationships (Phase 3+)

Tools (Actions)

# Session Memory Tools
set_task_context(task: str, details: dict) # Set current work context
get_task_context() -> TaskContext # Retrieve current context
search_commits(query: str) -> list[Commit] # Search commit history

# Long-term Memory Tools
semantic_search(query: str, limit: int) -> list[Result]
ingest_code(path: str, service: str) # Index codebase
ingest_adr(path: str, service: str) # Index decision record
ingest_incident(summary: str, severity: str, lessons: str)

# Memory Management (Phase 0+)
update_memory(key: str, value: Any, source: str) # With provenance
invalidate_memory(key: str, reason: str) # Explicit expiration
get_memory_provenance(key: str) -> Provenance # Audit trail

Key Design Principles

1. LLM-Agnostic via MCP

The system uses the Model Context Protocol (MCP), making it portable across:

  • Claude (via Claude Code)
  • OpenAI ChatGPT (MCP support added March 2025)
  • Google Gemini (MCP support announced April 2025)
  • Custom agents via programmatic access

2. Semantic Search over Keyword Matching

Long-term memory uses sentence-transformer embeddings for semantic similarity:

# Find conceptually related code, not just keyword matches
"authentication flow" → finds OAuth, JWT, session handling code

3. Multi-Project Awareness

Knowledge is indexed by service_name, enabling:

  • Cross-project searches ("How does auth-service handle tokens?")
  • Project-specific filtering ("Show incidents for payment-api only")
  • Organizational patterns ("What database patterns do we use?")

4. Automatic Embedding Maintenance

Pixeltable's computed columns automatically recompute embeddings when content changes:

# Embeddings stay in sync - no manual refresh needed
embedding=kb.content.apply(embed_text)

5. Defense in Depth (Python Version)

Per ADR-001, the system includes 7 layers of protection against Python version mismatch issues that could corrupt the Pixeltable database.

Use Cases

1. Architectural Continuity

User: "Why did we choose PostgreSQL over MongoDB?"
AI: [Queries ADRs] "ADR-005 documents this decision from March 2024..."

2. Incident-Aware Development

User: "I'm adding rate limiting to the auth service"
AI: [Queries incidents] "Note: We had an outage in November due to rate limiter
misconfiguration. The post-mortem recommended..."

3. Cross-Session Context

Session 1: User sets task context "Implementing OAuth2 PKCE flow"
Session 2: AI recalls task context and relevant ADRs automatically

4. Codebase Navigation

User: "How do we handle database connections?"
AI: [Semantic search] "Based on auth-service/db/pool.py and the connection
pooling ADR, you use..."

Rationale

Why Not Just Use RAG?

Traditional RAG (Retrieval Augmented Generation) typically:

  • Requires manual chunk management
  • Needs explicit embedding refresh
  • Lacks structured metadata (service, type, severity)
  • Doesn't integrate with development workflow

Luminescent Cluster adds:

  • Computed columns: Auto-updating embeddings
  • Typed knowledge: Code vs ADR vs incident with different schemas
  • Git integration: Session memory tied to repository state
  • MCP exposure: Native tool integration with AI assistants

Why MCP over Custom APIs?

  • Standardized: Works with any MCP-compatible client
  • Discoverable: Tools are self-documenting
  • Composable: Clients can orchestrate multiple MCP servers
  • Future-proof: Now backed by Linux Foundation with industry-wide adoption

Why Pixeltable?

  • Computed columns: Embeddings auto-update on content change
  • Multimodal ready: Can extend to images, videos
  • Snapshot/restore: Built-in versioning for knowledge base
  • Python-native: Fits development workflow

Improvement Options (December 2025 Research)

Based on current industry developments, the following enhancement options are presented for council review:

Option A: HybridRAG - Knowledge Graph Integration

What: Combine vector embeddings with a knowledge graph for multi-hop reasoning.

Industry Evidence:

  • Microsoft Research shows 2.8x accuracy improvement with hybrid approaches
  • GraphRAG (Microsoft) constructs knowledge graphs from unstructured data
  • Cedars-Sinai's AlzKB demonstrates real-world HybridRAG success

Implementation:

┌─────────────────────────────────────────────────────────────────┐
│ HybridRAG Architecture │
├────────────────────────┬────────────────────────────────────────┤
│ Vector Search │ Graph Traversal │
│ (Pixeltable) │ (Neo4j/Memgraph) │
├────────────────────────┼────────────────────────────────────────┤
│ • Semantic similarity │ • Entity relationships │
│ • Fuzzy matching │ • Multi-hop reasoning │
│ • Fast retrieval │ • Causal chains │
└────────────────────────┴────────────────────────────────────────┘


Reciprocal Rank Fusion


Neural Reranker

Benefits:

  • Multi-hop reasoning: "What services depend on the auth module that had the incident?"
  • Explicit relationships: Code → ADR → Incident → Resolution chains
  • Better temporal reasoning: Track how decisions evolved

Tradeoffs:

  • Additional infrastructure (graph database)
  • Data modeling complexity
  • Sync between vector and graph stores

Effort: Medium-High Impact: High for complex organizational queries


Option B: Mem0 Integration

What: Integrate Mem0's production-proven memory layer alongside Pixeltable.

Industry Evidence:

  • 26% accuracy improvement, 91% lower p95 latency, 90% token savings
  • 41K GitHub stars, 14M downloads, 186M API calls/quarter
  • AWS chose Mem0 as exclusive memory provider for Agent SDK
  • SOC 2 & HIPAA compliant

Architecture:

┌─────────────────────────────────────────────────────────────────┐
│ Memory Layer │
├────────────────────────┬────────────────────────────────────────┤
│ Pixeltable │ Mem0 │
│ (Long-term KB) │ (Conversational Memory) │
├────────────────────────┼────────────────────────────────────────┤
│ • Codebase index │ • User preferences │
│ • ADRs & incidents │ • Conversation facts │
│ • Static knowledge │ • Dynamic learning │
│ • Service-scoped │ • User/session-scoped │
└────────────────────────┴────────────────────────────────────────┘

Benefits:

  • Production-proven scale and performance
  • Graph-based memory variant (Mem0g) for relational knowledge
  • Three-line integration
  • Enterprise compliance built-in

Tradeoffs:

  • External dependency (cloud or self-hosted)
  • Potential overlap with Pixeltable functionality
  • Additional cost for hosted version

Effort: Low-Medium Impact: High for personalization and learning


Option C: Hindsight Agentic Memory

What: Adopt the four-network memory architecture that achieved 91.4% on LongMemEval.

Industry Evidence:

  • Highest accuracy on LongMemEval benchmark (December 2025)
  • Designed specifically for agents needing temporal/causal reasoning
  • TEMPR retrieval: semantic + keyword + graph + temporal filtering

Four Network Architecture:

┌──────────────────────────────────────────────────────────────────┐
│ Hindsight Memory Networks │
├─────────────────┬─────────────────┬─────────────────┬────────────┤
│ World │ Bank │ Opinion │ Observation│
│ Network │ Network │ Network │ Network │
├─────────────────┼─────────────────┼─────────────────┼────────────┤
│ External facts │ Agent's own │ Subjective │ Neutral │
│ about the world │ experiences & │ judgments with │ entity │
│ │ actions │ confidence │ summaries │
└─────────────────┴─────────────────┴─────────────────┴────────────┘

Mapped to Luminescent Cluster:

  • World Network: Codebase structure, API contracts, dependencies
  • Bank Network: What the AI has done, commits made, PRs reviewed
  • Opinion Network: Code quality assessments, architecture preferences
  • Observation Network: Service summaries, team conventions

Benefits:

  • Separates facts from opinions (prevents hallucination bleed)
  • Tracks agent's own actions (audit trail)
  • Temporal filtering for time-sensitive queries
  • Open source (Apache 2.0)

Tradeoffs:

  • Significant architectural change
  • New data model required
  • Less mature than Mem0 (newer project)

Effort: High Impact: Very High for long-horizon agent tasks


Option D: Context Engineering Enhancements

What: Implement advanced context management strategies.

Industry Evidence:

  • ACE framework: +10.6% on agents, +8.6% on finance benchmarks
  • Google ADK's Context Compaction reduces latency/tokens
  • Anthropic found isolated contexts outperform single-agent approaches

Strategies:

  1. Memory Blocks: Structure context into discrete functional units
┌─────────────────────────────────────────────────────────────────┐
│ Memory Block Layout │
├─────────────────────────────────────────────────────────────────┤
│ [System Block] │ Core instructions, persona │
│ [Project Block] │ Current project context, conventions │
│ [Task Block] │ Active task, goals, constraints │
│ [History Block] │ Compressed conversation history │
│ [Knowledge Block] │ Retrieved ADRs, incidents, code │
└─────────────────────────────────────────────────────────────────┘
  1. Context Compaction: Auto-summarize when threshold reached
  2. Selective Retrieval: Pull only relevant knowledge per query
  3. Context Isolation: Split complex tasks across subagents

Benefits:

  • Reduces "context rot" from oversized windows
  • Clear separation of concerns
  • Enables monitoring and debugging
  • Works with existing infrastructure

Tradeoffs:

  • Requires client-side coordination
  • MCP servers may not have visibility into full context
  • Summarization can lose nuance

Effort: Medium Impact: Medium-High for efficiency


Option E: Memory as a Service (MaaS)

What: Shift from agent-bound memory to shared memory services for multi-agent collaboration.

Industry Evidence:

  • Research shows agent memory silos hinder collaboration
  • MaaS paradigm emerging for multi-agent systems
  • Enables organizational memory shared across tools/agents

Architecture:

┌─────────────────────────────────────────────────────────────────┐
│ Memory as a Service (MaaS) │
├─────────────────────────────────────────────────────────────────┤
│ Shared Memory Layer │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Code KB │ │ Decision │ │ Incident │ │
│ │ Service │ │ Service │ │ Service │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
├─────────┼──────────────┼──────────────┼──────────────────────────┤
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Claude │ │ GPT Agent │ │ Custom │ │
│ │ Code │ │ │ │ Pipeline │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────────┘

Benefits:

  • Memory persists regardless of which agent/LLM is used
  • Enables handoff between specialized agents
  • Organizational knowledge accessible to all tools
  • Natural fit for MCP's server architecture

Tradeoffs:

  • Access control complexity
  • Consistency across agents
  • Security considerations (MEXTRA attack vulnerabilities)

Effort: Medium Impact: High for multi-agent workflows


Option F: LangMem Integration

What: Integrate LangChain's LangMem SDK for procedural, episodic, and semantic memory types.

Industry Evidence:

  • Native integration with LangGraph (popular agent framework)
  • DeepLearning.AI course validates approach
  • MongoDB, Redis integrations available
  • Part of LangChain's production ecosystem

Memory Types:

┌─────────────────────────────────────────────────────────────────┐
│ LangMem Memory Types │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ Procedural │ Episodic │ Semantic │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ How to do tasks │ Specific events │ General facts │
│ (coding styles, │ (this PR, that │ (architecture, │
│ conventions) │ incident) │ team knowledge) │
└─────────────────┴─────────────────┴─────────────────────────────┘

Benefits:

  • Well-documented, production-ready
  • Multiple persistence backends
  • Checkpointing and time-travel
  • Large community support

Tradeoffs:

  • LangChain dependency
  • May not integrate directly with MCP
  • Overlaps with Pixeltable functionality

Effort: Medium Impact: Medium for LangGraph users


Options Comparison Matrix

Weighted Decision Matrix

Criterion (Weight)A: HybridRAGB: Mem0C: HindsightD: Context EngE: MaaSF: LangMem
Accuracy (25%)★★★★★★★★☆☆★★★★★★★☆☆☆★★★☆☆★★★☆☆
Dev Experience (30%)★★★☆☆★★★★★★★★☆☆★★★★☆★★★★☆★★★★☆
Effort/Risk (20%)★☆☆☆☆★★★★☆★☆☆☆☆★★★★★★★★☆☆★★★★☆
Maturity (15%)★★★☆☆★★★★☆★★☆☆☆★★★★★★★★☆☆★★★★☆
Flexibility (10%)★★★★☆★★★☆☆★★★☆☆★★★★★★★★★☆★★★★☆
Weighted Score2.953.702.803.803.353.60

Quantitative Metrics (Hypotheses)

OptionAccuracy GainLatency ImpactEffortInfrastructureBest For
A. HybridRAG+180% (2.8x)*+50msHighGraph DBMulti-hop code queries
B. Mem0+26%-91% p95Low-MedCloud/Self-hostUser personalization
C. Hindsight+40%**NeutralHighNew data modelLong-horizon agents
D. Context Eng+10-30%-30%MediumNoneImmediate optimization
E. MaaSN/ANeutralMediumAPI layerMulti-agent workflows
F. LangMemModerateNeutralMediumLangChainLangGraph integration

*Based on Microsoft GraphRAG research (2024); requires validation on code domains **Estimated from LongMemEval 91.4% vs ~65% baseline RAG


Additional Industry Developments (Council Additions)

Based on council feedback, the following December 2025 developments were identified as missing:

Option G: Context Caching (Provider-Side)

What: Leverage provider-side context caching (Anthropic, OpenAI, Google) to avoid re-sending stable project context.

Industry Evidence:

  • By late 2025, all major providers offer context caching
  • Cost reduction: 75-90% for repeated project context
  • Latency reduction: Eliminate round-trip for cached prefixes

Implementation:

┌─────────────────────────────────────────────────────────────────┐
│ Context Caching Flow │
├─────────────────────────────────────────────────────────────────┤
│ [Stable Context] │ [Dynamic Context] │
│ • Project architecture │ • Current task │
│ • ADRs & conventions │ • Recent code changes │
│ • Team patterns │ • User query │
│ ↓ │ ↓ │
│ CACHED (TTL: 1 hour) │ SENT FRESH │
└─────────────────────────────────────────────────────────────────┘

Effort: Low Impact: High for cost/latency


Option H: Episodic Rollback ("Git for Memory")

What: Version control for memory state, enabling rollback when agents go wrong.

Industry Evidence:

  • Developer expectation from git workflows
  • Critical for debugging agent mistakes
  • Enables "what if" exploration

Implementation:

# Memory checkpoint API
checkpoint = memory.create_checkpoint("before-refactor")
# ... agent makes decisions ...
memory.rollback_to(checkpoint) # Undo bad decisions
memory.diff(checkpoint, "HEAD") # Compare states

Effort: Medium Impact: High for agent reliability


Option I: Provenance & Governance

What: Every memory item carries source links, timestamps, confidence, and validity scope.

Industry Evidence:

  • Enterprise requirement for audit trails
  • Prevents hallucination from unverified sources
  • Enables "why does the AI think this?" debugging

Schema:

@dataclass
class MemoryItem:
content: str
source: str # "PR #402", "ADR-005", "user:chris"
created_at: datetime
expires_at: datetime # TTL-based expiration
confidence: float # 0.0-1.0
verified_by: str # Human approval if required

Effort: Medium Impact: High for enterprise trust


Option J: Retrieval-Augmented Thoughts (RAT)

What: Interleave retrieval during chain-of-thought, not just before generation.

Industry Evidence:

  • Research shows improved reasoning with mid-thought retrieval
  • Addresses "lost in the middle" problem
  • More accurate for complex technical queries

Flow:

Traditional RAG:  [Retrieve] → [Think] → [Respond]
RAT: [Think] → [Retrieve] → [Think more] → [Retrieve] → [Respond]

Effort: High (requires reasoning model integration) Impact: High for complex queries


Option K: Temporal Memory Decay (Forgetting Curve)

What: Implement forgetting mechanisms so old/unused memories decay in relevance.

Industry Evidence:

  • "Relevance pollution" is a real problem at scale
  • Old decisions may conflict with new ones
  • Memory systems need pruning strategies

Implementation:

# Decay function
relevance = base_relevance * exp(-λ * days_since_access)
# Re-access refreshes relevance
# Low-relevance items deprioritized in retrieval

Effort: Low-Medium Impact: Medium for long-term maintenance

Critical Change: The council unanimously agreed the original roadmap was "Optimization before Foundation." The revised roadmap prioritizes quick wins and foundations before heavy architectural lifts.

Phase 0: Foundations (Required First)

Purpose: Cannot optimize what you cannot measure

Deliverables:

  1. Evaluation Harness

    • Fixed task set + automated scoring
    • Retrieval quality metrics (precision@k, citation correctness)
    • Latency/cost instrumentation
    • Contradiction/hallucination tests
  2. Memory Schema & Lifecycle

    • Define memory types (decisions, facts, procedures, preferences)
    • TTL and expiration policies
    • Versioning strategy (Option H foundation)
    • "Source of truth" policy
  3. Governance & Observability

    • Trace: retrieval → context assembly → model output
    • Log which memories were used and why
    • Audit trail for memory changes
    • Access control framework

Exit Criteria: Baseline metrics established, governance policies documented


Phase 1: Conversational Memory (Pixeltable Native) - 8 WEEKS

Decision: Build on Pixeltable (not Mem0 integration - see Architecture Decision above)

Purpose: Unified memory store with extraction capabilities

Critical Architecture: The "Janitor" Pattern (Council Mandated)

Running LLM extraction synchronously kills latency. We adopt a tiered approach:

┌─────────────────────────────────────────────────────────────────────────────┐
│ HOT / WARM / COLD MEMORY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TIER 1: HOT MEMORY (Real-time) │
│ ───────────────────────────── │
│ • Raw chat history retrieval │
│ • No extraction cost │
│ • Latency: <50ms │
│ │
│ TIER 2: WARM MEMORY (Async Extraction) │
│ ────────────────────────────────────── │
│ • Pixeltable computed columns with SMALL model (Llama-3-8B, Haiku) │
│ • Extraction runs AFTER response sent to user │
│ • Latency: Background (no user impact) │
│ │
│ TIER 3: COLD MEMORY (Scheduled Consolidation) │
│ ───────────────────────────────────────────── │
│ • "Janitor Process" - nightly batch job │
│ • Uses REASONING model (GPT-4o, Opus) for complex dedup │
│ • Merges facts, resolves contradictions, expires old data │
│ • Latency: Nightly (no user impact) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Deliverables (Extended to 8 Weeks per Council):

  1. Phase 1a: Storage & Hot Memory (Weeks 1-2)

    • Add user_memory table to Pixeltable schema
    • Add conversation_memory table with TTL support
    • Implement user/project scoping
    • Basic CRUD through MCP tools
    • Raw retrieval (Hot Memory tier)
  2. Phase 1b: Async Extraction (Weeks 3-4)

    • Create extract_memory_facts() UDF using small model (Haiku/Llama-3-8B)
    • Async execution: Extract AFTER response sent
    • Confidence scoring and thresholds
    • Extraction determinism: temperature=0
    • Store raw source alongside extracted facts for re-processing
  3. Phase 1c: Retrieval & Ranking (Weeks 5-6)

    • Ranking logic: Hot facts vs Extracted facts
    • Query rewriting for memory search
    • Scope-aware retrieval (user vs project vs global)
  4. Phase 1d: Janitor Process (Weeks 7-8)

    • Nightly consolidation job using reasoning model (GPT-4o)
    • Basic deduplication (>85% similarity threshold)
    • Simple contradiction handling: newer wins, flag for review
    • Memory decay: reduce relevance of unaccessed items
    • Defer complex contradiction resolution to Phase 2

Schema Definition:

# Pixeltable tables for conversational memory
user_memory = pt.Table(
"user_memory",
columns={
"user_id": pt.String,
"content": pt.String,
"memory_type": pt.String, # preference, fact, decision
"confidence": pt.Float,
"source": pt.String, # conversation_id, manual
"raw_source": pt.String, # Original text for re-extraction
"extraction_version": pt.Int, # For re-processing on prompt updates
"created_at": pt.Timestamp,
"last_accessed_at": pt.Timestamp, # For decay scoring
"expires_at": pt.Timestamp, # TTL support
},
computed_columns={
"embedding": embed_text(content),
"scope": memory_scope(user_id, project_id),
}
)

Validation Metrics (Council Required):

MetricTargetMeasurement
Hot memory latency<50ms p95Instrumentation
Extraction precision>85%Golden Dataset eval
Retrieval relevance>90% rated "helpful"Blind evaluation
Storage efficiency<10% duplicates post-consolidationAutomated check
Query latency<200ms p95Instrumentation

Golden Dataset (Council Required): Create 50 static questions representing real project queries:

  • "What database do we use?"
  • "Why did we reject MongoDB?"
  • "What's our logging convention?"
  • "Who approved the Kafka decision?"

Run regression tests against Golden Dataset on every PR.

Exit Criteria:

  • 85% accuracy on Golden Dataset

  • Zero cross-user memory leakage
  • Janitor process completes in <10 minutes for 10k memories

Decision Reversal Trigger: If Week 4 checkpoint shows extraction precision <70%, evaluate:

  1. Scoping down consolidation features
  2. Revisiting Mem0 for extraction-only (not storage)

Phase 2: Context Engineering Optimization

Options: D (Context Engineering) + I (Provenance) + K (Temporal Decay)

Purpose: Optimize retrieval and context assembly

Deliverables:

  1. Memory Blocks Architecture

    [System Block]     │ Core instructions, persona
    [Project Block] │ Current project context, conventions
    [Task Block] │ Active task, goals, constraints
    [History Block] │ Compressed conversation history
    [Knowledge Block] │ Retrieved ADRs, incidents, code
  2. Retrieval Improvements

    • Query rewriting for better matches
    • Reranking layer for relevance
    • Selective injection based on query type
    • Contextual compression for long memories
  3. Provenance Tracking

    • Source links for all memories
    • Confidence scoring
    • Temporal decay implementation

Exit Criteria:

  • 30% token efficiency improvement
  • Provenance available for all retrieved items
  • Stale memory detection operational

Phase 3: Knowledge Structure (HybridRAG)

Options: A (HybridRAG with Knowledge Graph)

Purpose: Multi-hop reasoning for complex code queries

Prerequisites: Stable data from Phase 1-2, validated need for graph queries

Deliverables:

  • Knowledge graph for code dependencies
  • Entity extraction pipeline (async, not blocking)
  • Reciprocal Rank Fusion (vector + graph)
  • Neural reranker for final results

Target Queries (justify the investment):

"What services depend on auth that had incidents last month?"
"Show me all code that changed because of ADR-005"
"Which team owns the service that calls the failing endpoint?"

Exit Criteria:

  • Multi-hop queries outperform pure vector by >50%
  • Latency <1s for graph-augmented queries
  • Entity extraction runs async (no chat blocking)

Phase 4: Advanced Capabilities

Options: C (Hindsight), E (MaaS), J (RAT)

Purpose: Future-state capabilities based on validated needs

Conditional Entry:

  • Hindsight (C): Only if Phase 3 reveals need for temporal/causal reasoning
  • MaaS (E): Only if multi-agent architecture is adopted
  • RAT (J): Only if complex reasoning queries dominate usage

Approach: Cherry-pick techniques from Hindsight rather than wholesale adoption


Roadmap Visualization

┌─────────────────────────────────────────────────────────────────────────────┐
│ ROADMAP TIMELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 │
│ Foundations Mem0 Context Eng HybridRAG Advanced │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Eval │ │ Memory │ │ Blocks │ │ Graph │ │Hindsight│ │
│ │ Schema │─────▶│ Policy │─────▶│ Rerank │─────▶│ Vector │───▶│ MaaS │ │
│ │ Govern │ │ Cache │ │ Decay │ │ Fusion │ │ RAT │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ │
│ │
│ [Must Have] [Quick Win] [Optimization] [Conditional] [Future] │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Tier Constraints (ADR-004): Memory persistence depth and retention policies vary by monetization tier. Free tier has local-only storage; Team tier includes cloud persistence with GDPR auto-deletion; Enterprise tier supports configurable retention with legal hold. See ADR-004 for tier definitions and ADR-007 for implementation details.


Implementation Tracker

PhaseComponentStatusNotes
Phase 0Evaluation Harness✅ Completesrc/memory/evaluation/ - 28 tests
Phase 0Golden Dataset✅ Completetests/memory/golden_dataset.json - 50 questions
Phase 0Memory Schema & Lifecycle✅ Completesrc/memory/schemas/, src/memory/lifecycle/ - 36 tests
Phase 0Governance & Observability✅ Completesrc/memory/observability/ - 27 tests
Phase 0MemoryProvider Protocol✅ Completesrc/memory/protocols.py - ADR-007 compliant
Phase 1Session Memory MCP✅ Completesession_memory_server.py - 45 tests
Phase 1Pixeltable MCP✅ Completepixeltable_mcp_server.py - 62 tests
Phase 1aHot Memory Storage✅ Completesrc/memory/storage/, src/memory/providers/local.py - 27 tests
Phase 1aMemory MCP Tools✅ Completesrc/memory/mcp/tools.py - 21 tests
Phase 1aHot Memory Latency✅ Complete<50ms p95 target met - 6 tests
Phase 1bAsync Extraction✅ Completesrc/memory/extraction/ - 36 tests
Phase 1bExtraction Precision✅ Complete>85% target met (100% achieved) - 5 tests
Phase 1cRetrieval & Ranking✅ Completesrc/memory/retrieval/ - 31 tests
Phase 1cQuery Rewriting✅ CompleteSynonym expansion for better recall
Phase 1cScope-Aware Retrieval✅ Completeuser > project > global hierarchy
Phase 1cRetrieval Latency✅ Complete<200ms p95 target met - 6 tests
Phase 1dJanitor Process✅ Completesrc/memory/janitor/ - 24 tests
Phase 1dDeduplication✅ Complete>85% similarity threshold
Phase 1dContradiction Handling✅ Complete"newer wins" with flagging
Phase 1dExpiration Cleanup✅ CompleteTTL-based expiration
Phase 1dJanitor Performance✅ Complete<10 min for 10k target met - 5 tests
CI/SecurityMemory Isolation✅ CompleteZero cross-user leakage - 8 tests
CI/SecurityProtocol Compliance✅ CompleteThree-layer testing - 23 tests
CI/SecurityCI Workflow✅ Complete.github/workflows/memory-evaluation.yml
Phase 2Memory Blocks Architecture📝 Not StartedContext Engineering
Phase 2Provenance Tracking🔄 PartialVia AuditLogger protocol (ADR-007)
Phase 2Temporal Decay✅ CompleteIntegrated in retrieval ranking
Phase 3Knowledge Graph📝 Not StartedHybridRAG
Phase 3Entity Extraction📝 Not StartedAsync pipeline
Phase 4Hindsight Integration📝 Not StartedConditional on Phase 3
Phase 4MaaS Architecture📝 Not StartedMulti-agent support

Test Summary: 304 tests passing (as of 2025-12-28)

Legend: ✅ Complete | 🔄 Partial | 📝 Not Started | ❌ Blocked


Consequences

Positive

  • AI assistants maintain context across sessions
  • Reduced repetitive context loading (estimated 60-80% reduction)
  • Institutional knowledge becomes searchable
  • Consistent AI suggestions aligned with past decisions
  • Multi-project organizational awareness
  • Foundation for advanced multi-agent workflows

Negative

  • Requires initial setup and ingestion
  • Python version binding (mitigated by ADR-001)
  • Storage growth with codebase size
  • Embedding computation cost (mitigated by local models)
  • New infrastructure to maintain (Pixeltable, MCP servers)
  • Team learning curve on memory-augmented AI patterns

Neutral

  • Requires MCP-compatible client
  • Team adoption needed for full benefit
  • ADR/incident discipline improves value
  • Commits us to MCP protocol (reasonable bet given industry adoption)
  • May require revisiting if context windows grow significantly larger

Risks and Mitigations (Council Additions)

Technical Risks

RiskProbabilityImpactMitigation
Memory PoisoningMediumHighProvenance tracking, confidence scoring, human approval gates for "decision-grade" memories
Staleness & DriftHighHighLink memories to repo commits, TTL policies, periodic revalidation jobs, git hook triggers
ContradictionsMediumMediumConflict resolution policy (prefer newer? prefer docs?), show both with citations
Latency BlowupsMediumHighBudgets per tier, caching, "fast path" vs "deep reasoning path" separation
Relevance PollutionHighMediumTemporal decay (Option K), ranking algorithms, "forgetting curve" implementation
Schema DriftHighMediumAsync re-indexing on commit, version migrations, schema evolution plan

Organizational Risks

RiskProbabilityImpactMitigation
Scope Creep via OptionsHighHighHard constraint: Pick ≤3 options for 2025, phase gates required
Benchmark GamingLowMediumInclude qualitative user feedback, test against real project history
Vendor Lock-inMediumMediumDefine internal memory APIs and adapters, avoid proprietary-only features
Operational ComplexityMediumMediumDocument backup/migration procedures, plan embedding model upgrades

Security & Privacy Risks

RiskProbabilityImpactMitigation
Memory LeakageMediumHighStrict namespace isolation, MCP scoping, audit logs for cross-project retrievals
MEXTRA AttacksLowHighOutput filtering, memory de-identification, input validation
Secret ExposureLowCriticalSecret redaction in ingestion, .gitignore-style exclusion patterns

Risk Assessment by Council Model

Gemini: "Memory update triggers (Git hooks) are required to keep memory synchronized. Without them, the Knowledge Graph becomes a hallucination source."

Claude: "Wrong context is worse than no context. Memory poisoning needs confidence scoring and human approval gates."

Grok: "Ensure memory systems don't perpetuate biases in persistent context via biased knowledge graphs."

GPT: "Memory systems need pruning strategies - relevance pollution is a real problem at scale."

Security Considerations

Per recent research (MEXTRA attack, February 2025), memory systems are vulnerable to:

  • Prompt injection extracting stored memories
  • Cross-session data leakage

Mitigations (to implement):

  • User/session-level isolation
  • Memory de-identification
  • Output filtering
  • Audit logging
  • ADR-001: Python Version Requirement (database integrity)
  • ADR-002: Workflow Integration (automated ingestion)

References

Open Questions

The following questions have been investigated and resolved:

  1. Single-tenant vs Multi-tenant: ✅ Resolved - Multi-tenancy is supported via the TenantProvider protocol (ADR-005) and CloudTenantProvider implementation (ADR-007). See ADR-007 Section 1a for tenant isolation enforcement points.

  2. Memory Approval Workflows: ✅ Resolved - The AuditLogger protocol (ADR-007) provides provenance tracking. GDPR compliance workflows in GDPRService (luminescent-cloud) implement approval gates for data deletion.

  3. Retention Policies: ✅ Resolved - ADR-007 Section 4 defines tier-specific retention:

    • Free (OSS): User's responsibility (no auto-deletion)
    • Team (Cloud): Auto-delete on workspace exit (GDPR Article 17)
    • Enterprise: Configurable (legal hold support)
  4. Embedding Model Upgrades: 🔄 Partially Resolved - Pixeltable computed columns support re-indexing. Full migration strategy deferred to Phase 3 (HybridRAG).

  5. Conflict Resolution: 🔄 Partially Resolved - Phase 1d Janitor Process implements "newer wins" with flagging for review. Complex contradiction handling deferred to Phase 2 (Context Engineering).

Council Review Summary

Review Date: 2025-12-22 Council Configuration: High confidence (all 4 models) Models: Gemini-3-Pro, Claude Opus 4.5, Grok-4, GPT-5.2-Pro

Unanimous Recommendations (All 4 Models)

  1. Add Phase 0 for foundations, evaluation, and governance
  2. Reorder roadmap: Mem0 before HybridRAG
  3. Add quantified success metrics
  4. Add provenance/governance as core requirement
  5. Include context caching for cost optimization

Key Insights by Model

  • Gemini: "Optimization before Foundation" is the core flaw
  • Claude: "Wrong context is worse than no context"
  • Grok: Emphasized bias prevention in knowledge graphs
  • GPT: Highlighted need for "Phase 0" evaluation harness
  • ADR-001: Python Version Requirement (database integrity)
  • ADR-002: Workflow Integration (automated ingestion)
  • ADR-004: Monetization Strategy (tier-based memory constraints)
  • ADR-005: Repository Organization Strategy (OSS vs Paid separation, Extension Registry)
  • ADR-006: Chatbot Platform Integrations (adapters implementing memory I/O)
  • ADR-007: Cross-ADR Integration Guide (Phase Alignment Matrix, protocol consolidation)

Changelog

VersionDateChanges
1.02025-12-22Initial draft
2.02025-12-22Added industry research and 6 improvement options
3.02025-12-22Council Review: Revised roadmap (Phase 0 + reordering), added decision drivers, non-goals, success metrics table, interface contract, 5 additional options (G-K), comprehensive risk matrix, council feedback summary
4.02025-12-22Council Review #2: Rejected Mem0 integration in favor of Pixeltable-native memory. Added "Build vs Integrate" decision section with code examples. Updated Phase 1 to Pixeltable Native approach. Added reference to ADR-004 (Monetization).
4.12025-12-23Council Validation: Extended Phase 1 to 8 weeks. Added "Janitor" (Hot/Warm/Cold) architecture for async extraction. Added Golden Dataset requirement. Added validation metrics and decision reversal triggers.
4.22025-12-28Cross-ADR Synchronization: Added ADR-006 and ADR-007 to Related Decisions. Updated Non-Goals to reflect multi-tenancy via Extension Registry. Added Interface Contract admonition linking to consolidated protocols. Marked Open Questions as resolved with references. Added Implementation Tracker section. Added tier constraints note to Roadmap. Council review (Grok-4, GPT-4o).