ADR-011: Redis Single Instance Strategy
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- Architecture Team - Caching architecture
- SRE Team - Operations requirements
Layer
Caching
Related ADRs
- ADR-012: Token Bucket Rate Limiting
- ADR-013: Graceful Redis Degradation
- ADR-014: Entity Counts Batching
- ADR-033: Session Management
Supersedes
None
Depends On
- ADR-010: Docker Compose Development
Context
The SRE Operations Platform needs caching for:
- Performance: Reduce database load for frequent queries
- Rate Limiting: Protect API from abuse
- Session Storage: User session management
- Distributed Locking: Coordinate multi-instance operations
- Background Tasks: Queue and progress tracking
Key constraints:
- Must support all caching use cases in single deployment
- Need graceful degradation when unavailable
- Require logical separation of concerns
- Must work in development and production
- Keep operational complexity low
Decision
We adopt a single Redis instance with database separation strategy:
Key Design Decisions
- Single Instance: One Redis server handles all use cases
- Database Separation: Different Redis databases (0-7) for isolation
- Graceful Degradation: System works without Redis
- AOF Persistence: Append-only file for durability
- Connection Pooling: Efficient connection management
Database Allocation
| DB | Purpose | TTL Strategy |
|---|---|---|
| 0 | Entity Cache | 5 min default |
| 1 | Query Cache | 2-5 min |
| 2 | Rate Limiting | Sliding window |
| 3 | Sessions | 8-24 hours |
| 4 | Distributed Locks | 30 sec - 5 min |
| 5 | Background Tasks | Task duration |
| 6 | AI Cache | 1 hour |
| 7 | Reserved | Future use |
Configuration
# Redis configuration
REDIS_URL = "redis://localhost:6379/0"
REDIS_PASSWORD = None # Optional
REDIS_SOCKET_TIMEOUT = 5
REDIS_CONNECTION_POOL_SIZE = 10
Docker Compose Service
redis:
image: redis:7-alpine
command: redis-server --appendonly yes
ports: ["6379:6379"]
volumes: [redis_data:/data]
healthcheck:
test: ["CMD", "redis-cli", "ping"]
Consequences
Positive
- Simplicity: One service to manage
- Cost Effective: No need for Redis Cluster
- Logical Isolation: Database separation provides clear boundaries
- Persistence: AOF ensures data survives restarts
- Low Latency: In-memory with sub-millisecond access
Negative
- Single Point of Failure: Redis down affects all caching (mitigated by graceful degradation)
- Memory Constraints: All data in single instance memory
- No True Isolation: Databases share resources
- Scaling Limit: Eventually may need cluster
Neutral
- Monitoring: Single instance is easier to monitor
- Backup: Simple backup/restore process
Alternatives Considered
1. Redis Cluster
- Approach: Sharded Redis with automatic failover
- Rejected: Over-engineered for current scale, operational complexity
2. Multiple Redis Instances
- Approach: Separate instances per use case
- Rejected: More resources, more complexity
3. Memcached + Redis
- Approach: Memcached for cache, Redis for features
- Rejected: Two systems to manage, feature overlap
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Cache Manager:
backend/core/caching/cache_manager.py - Redis Setup:
backend/core/redis_manager.py - Docker Config:
docker-compose.yml - Docs:
backend/docs/architecture/redis-implementation-complete.md
Compliance/Validation
- Automated checks: Health check endpoint for Redis
- Manual review: Memory usage reviewed monthly
- Metrics: Hit rate, miss rate, memory usage via Prometheus
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH MODIFICATIONS
Quality Metrics
- Consensus Strength Score (CSS): 1.0
- Deliberation Depth Index (DDI): 0.92
Council Feedback Summary
The council identified a critical architectural flaw: the multi-database strategy (DB 0-7) is incompatible with future scaling to Redis Cluster, which only supports DB 0.
Key Concerns Identified:
- Cluster Incompatibility: Redis Cluster doesn't support SELECT (DB 1-7)
- No Resource Isolation: All DBs share CPU, memory, and eviction policy
- Connection Pool Too Low: 10 connections may bottleneck with 100 concurrent users
Required Modifications:
- Replace DB Indexes with Key Namespacing:
entity:{type}:{id}
cache:query:{hash}
ratelimit:{id}
session:{id}
lock:{resource}
ai:{hash} - Increase Connection Pool: From 10 to 50 connections
- Plan HA Before Cluster: Redis Sentinel for high availability first
- Add Flush-Safety Classification: Document which key prefixes are safe to flush
Modifications Applied
- Updated key naming conventions to use prefixes
- Documented scaling path: Single → Sentinel (HA) → Cluster (sharding)
- Added memory limits and maxmemory-policy configuration
- Documented flush-safety classification per key prefix
Council Ranking
- gpt-5.2: 0.889 (Best Response - identified Cluster incompatibility)
- gemini-3-pro: 0.5
- claude-opus-4.5: 0.5
- grok-4.1: 0.0
Operational Guidelines (APPROVED_WITH_MODS)
Redis Sentinel Upgrade Path
Current State (Single Instance):
┌─────────────────────┐
│ Redis (Primary) │
│ localhost:6379 │
│ DB 0: cache │
│ DB 1: sessions │
│ DB 2: rate_limit │
└─────────────────────┘
Phase 1: Redis Sentinel (High Availability):
┌─────────────────────┐ ┌─────────────────────┐
│ Redis Primary │────▶│ Redis Replica │
│ redis-1:6379 │ │ redis-2:6379 │
└─────────────────────┘ └─────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────┐
│ Sentinel Cluster (3 nodes) │
│ sentinel-1:26379 sentinel-2:26379 sentinel-3 │
└─────────────────────────────────────────────────┘
Sentinel Configuration:
# sentinel.conf
sentinel monitor ops-master redis-1 6379 2
sentinel down-after-milliseconds ops-master 5000
sentinel failover-timeout ops-master 60000
sentinel parallel-syncs ops-master 1
Application Configuration for Sentinel:
# backend/core/redis/config.py
from redis.sentinel import Sentinel
def get_redis_client():
if settings.redis_sentinel_enabled:
sentinel = Sentinel(
[('sentinel-1', 26379), ('sentinel-2', 26379), ('sentinel-3', 26379)],
socket_timeout=0.5
)
return sentinel.master_for('ops-master', socket_timeout=0.5)
return redis.Redis.from_url(settings.redis_url)
Phase 2: Redis Cluster (Sharding) - Future:
- Requires key namespacing migration (no DB selection)
- Use hash tags for multi-key operations:
{user:123}:session,{user:123}:prefs - Minimum 6 nodes (3 masters, 3 replicas)
Memory Budget per Database
Current Allocation (256MB Total):
| DB | Purpose | Max Memory | Eviction Policy | Notes |
|---|---|---|---|---|
| 0 | Entity Cache | 100MB | volatile-lru | Short TTL (5min) |
| 1 | Sessions | 50MB | volatile-ttl | Session TTL (8hr) |
| 2 | Rate Limits | 20MB | volatile-ttl | Sliding windows |
| 3 | Distributed Locks | 10MB | noeviction | Must not evict |
| 4 | Background Jobs | 30MB | volatile-lru | Task queues |
| 5 | Reserved | 20MB | - | Future use |
| 6 | Reserved | 20MB | - | Future use |
| 7 | Test Only | 6MB | allkeys-lru | CI/CD tests |
Monitoring Memory Usage:
# Per-database memory (requires DEBUG)
redis-cli INFO memory | grep used_memory_human
# Key count per database
for db in {0..7}; do
echo "DB $db: $(redis-cli -n $db DBSIZE)"
done
# Memory by key pattern
redis-cli --scan --pattern 'entity:*' | xargs -I{} redis-cli MEMORY USAGE {}
Configuration:
# redis.conf
maxmemory 256mb
maxmemory-policy volatile-lru
maxmemory-samples 10
# Alert thresholds (Prometheus)
# - Warning: 80% (204MB)
# - Critical: 90% (230MB)
Memory Pressure Response:
- 80% Usage: Increase TTL aggressiveness, review cache hit rates
- 90% Usage: Scale to larger instance or add Sentinel replica
- 95% Usage: Emergency: flush non-critical caches (DB 0, 4)
References
- Redis Documentation
- Redis Sentinel
- Redis Cluster
- Redis Data Persistence
- Industry patterns: Session store, rate limiting
ADR-011 | Caching Layer | Implemented | APPROVED_WITH_MODS Completed