Skip to main content

ADR-011: Redis Single Instance Strategy

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • Architecture Team - Caching architecture
  • SRE Team - Operations requirements

Layer

Caching

  • ADR-012: Token Bucket Rate Limiting
  • ADR-013: Graceful Redis Degradation
  • ADR-014: Entity Counts Batching
  • ADR-033: Session Management

Supersedes

None

Depends On

  • ADR-010: Docker Compose Development

Context

The SRE Operations Platform needs caching for:

  1. Performance: Reduce database load for frequent queries
  2. Rate Limiting: Protect API from abuse
  3. Session Storage: User session management
  4. Distributed Locking: Coordinate multi-instance operations
  5. Background Tasks: Queue and progress tracking

Key constraints:

  • Must support all caching use cases in single deployment
  • Need graceful degradation when unavailable
  • Require logical separation of concerns
  • Must work in development and production
  • Keep operational complexity low

Decision

We adopt a single Redis instance with database separation strategy:

Key Design Decisions

  1. Single Instance: One Redis server handles all use cases
  2. Database Separation: Different Redis databases (0-7) for isolation
  3. Graceful Degradation: System works without Redis
  4. AOF Persistence: Append-only file for durability
  5. Connection Pooling: Efficient connection management

Database Allocation

DBPurposeTTL Strategy
0Entity Cache5 min default
1Query Cache2-5 min
2Rate LimitingSliding window
3Sessions8-24 hours
4Distributed Locks30 sec - 5 min
5Background TasksTask duration
6AI Cache1 hour
7ReservedFuture use

Configuration

# Redis configuration
REDIS_URL = "redis://localhost:6379/0"
REDIS_PASSWORD = None # Optional
REDIS_SOCKET_TIMEOUT = 5
REDIS_CONNECTION_POOL_SIZE = 10

Docker Compose Service

redis:
image: redis:7-alpine
command: redis-server --appendonly yes
ports: ["6379:6379"]
volumes: [redis_data:/data]
healthcheck:
test: ["CMD", "redis-cli", "ping"]

Consequences

Positive

  • Simplicity: One service to manage
  • Cost Effective: No need for Redis Cluster
  • Logical Isolation: Database separation provides clear boundaries
  • Persistence: AOF ensures data survives restarts
  • Low Latency: In-memory with sub-millisecond access

Negative

  • Single Point of Failure: Redis down affects all caching (mitigated by graceful degradation)
  • Memory Constraints: All data in single instance memory
  • No True Isolation: Databases share resources
  • Scaling Limit: Eventually may need cluster

Neutral

  • Monitoring: Single instance is easier to monitor
  • Backup: Simple backup/restore process

Alternatives Considered

1. Redis Cluster

  • Approach: Sharded Redis with automatic failover
  • Rejected: Over-engineered for current scale, operational complexity

2. Multiple Redis Instances

  • Approach: Separate instances per use case
  • Rejected: More resources, more complexity

3. Memcached + Redis

  • Approach: Memcached for cache, Redis for features
  • Rejected: Two systems to manage, feature overlap

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Cache Manager: backend/core/caching/cache_manager.py
  • Redis Setup: backend/core/redis_manager.py
  • Docker Config: docker-compose.yml
  • Docs: backend/docs/architecture/redis-implementation-complete.md

Compliance/Validation

  • Automated checks: Health check endpoint for Redis
  • Manual review: Memory usage reviewed monthly
  • Metrics: Hit rate, miss rate, memory usage via Prometheus

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH MODIFICATIONS

Quality Metrics

  • Consensus Strength Score (CSS): 1.0
  • Deliberation Depth Index (DDI): 0.92

Council Feedback Summary

The council identified a critical architectural flaw: the multi-database strategy (DB 0-7) is incompatible with future scaling to Redis Cluster, which only supports DB 0.

Key Concerns Identified:

  1. Cluster Incompatibility: Redis Cluster doesn't support SELECT (DB 1-7)
  2. No Resource Isolation: All DBs share CPU, memory, and eviction policy
  3. Connection Pool Too Low: 10 connections may bottleneck with 100 concurrent users

Required Modifications:

  1. Replace DB Indexes with Key Namespacing:
    entity:{type}:{id}
    cache:query:{hash}
    ratelimit:{id}
    session:{id}
    lock:{resource}
    ai:{hash}
  2. Increase Connection Pool: From 10 to 50 connections
  3. Plan HA Before Cluster: Redis Sentinel for high availability first
  4. Add Flush-Safety Classification: Document which key prefixes are safe to flush

Modifications Applied

  1. Updated key naming conventions to use prefixes
  2. Documented scaling path: Single → Sentinel (HA) → Cluster (sharding)
  3. Added memory limits and maxmemory-policy configuration
  4. Documented flush-safety classification per key prefix

Council Ranking

  • gpt-5.2: 0.889 (Best Response - identified Cluster incompatibility)
  • gemini-3-pro: 0.5
  • claude-opus-4.5: 0.5
  • grok-4.1: 0.0

Operational Guidelines (APPROVED_WITH_MODS)

Redis Sentinel Upgrade Path

Current State (Single Instance):

┌─────────────────────┐
│ Redis (Primary) │
│ localhost:6379 │
│ DB 0: cache │
│ DB 1: sessions │
│ DB 2: rate_limit │
└─────────────────────┘

Phase 1: Redis Sentinel (High Availability):

┌─────────────────────┐     ┌─────────────────────┐
│ Redis Primary │────▶│ Redis Replica │
│ redis-1:6379 │ │ redis-2:6379 │
└─────────────────────┘ └─────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────┐
│ Sentinel Cluster (3 nodes) │
│ sentinel-1:26379 sentinel-2:26379 sentinel-3 │
└─────────────────────────────────────────────────┘

Sentinel Configuration:

# sentinel.conf
sentinel monitor ops-master redis-1 6379 2
sentinel down-after-milliseconds ops-master 5000
sentinel failover-timeout ops-master 60000
sentinel parallel-syncs ops-master 1

Application Configuration for Sentinel:

# backend/core/redis/config.py
from redis.sentinel import Sentinel

def get_redis_client():
if settings.redis_sentinel_enabled:
sentinel = Sentinel(
[('sentinel-1', 26379), ('sentinel-2', 26379), ('sentinel-3', 26379)],
socket_timeout=0.5
)
return sentinel.master_for('ops-master', socket_timeout=0.5)
return redis.Redis.from_url(settings.redis_url)

Phase 2: Redis Cluster (Sharding) - Future:

  • Requires key namespacing migration (no DB selection)
  • Use hash tags for multi-key operations: {user:123}:session, {user:123}:prefs
  • Minimum 6 nodes (3 masters, 3 replicas)

Memory Budget per Database

Current Allocation (256MB Total):

DBPurposeMax MemoryEviction PolicyNotes
0Entity Cache100MBvolatile-lruShort TTL (5min)
1Sessions50MBvolatile-ttlSession TTL (8hr)
2Rate Limits20MBvolatile-ttlSliding windows
3Distributed Locks10MBnoevictionMust not evict
4Background Jobs30MBvolatile-lruTask queues
5Reserved20MB-Future use
6Reserved20MB-Future use
7Test Only6MBallkeys-lruCI/CD tests

Monitoring Memory Usage:

# Per-database memory (requires DEBUG)
redis-cli INFO memory | grep used_memory_human

# Key count per database
for db in {0..7}; do
echo "DB $db: $(redis-cli -n $db DBSIZE)"
done

# Memory by key pattern
redis-cli --scan --pattern 'entity:*' | xargs -I{} redis-cli MEMORY USAGE {}

Configuration:

# redis.conf
maxmemory 256mb
maxmemory-policy volatile-lru
maxmemory-samples 10

# Alert thresholds (Prometheus)
# - Warning: 80% (204MB)
# - Critical: 90% (230MB)

Memory Pressure Response:

  1. 80% Usage: Increase TTL aggressiveness, review cache hit rates
  2. 90% Usage: Scale to larger instance or add Sentinel replica
  3. 95% Usage: Emergency: flush non-critical caches (DB 0, 4)

References


ADR-011 | Caching Layer | Implemented | APPROVED_WITH_MODS Completed