ADR-013: Graceful Redis Degradation
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- SRE Team - Reliability requirements
- Architecture Team - Failure handling
Layer
Caching
Related ADRs
- ADR-011: Redis Single Instance Strategy
- ADR-012: Token Bucket Rate Limiting
Supersedes
None
Depends On
- ADR-011: Redis Single Instance Strategy
Context
Redis is a dependency for caching, rate limiting, and sessions, but:
- Availability: Redis may be temporarily unavailable
- Development: Developers may not run Redis locally
- Testing: Unit tests should run without Redis
- Resilience: Application should not fail without cache
- Performance: Degraded mode should still be usable
Key constraints:
- Application must start without Redis
- Core functionality must work without cache
- Clear logging when degraded
- Automatic recovery when Redis returns
- Minimal performance impact from checks
Decision
We implement graceful degradation at every Redis touchpoint:
Key Design Decisions
- Try/Except Wrapper: All Redis operations wrapped
- Fallback Behavior: Defined for each operation type
- Health Monitoring: Periodic Redis connectivity checks
- Automatic Recovery: Reconnect when Redis available
- Metrics Tracking: Count degraded operations
Degradation Behaviors
| Operation | Normal | Degraded |
|---|---|---|
| Cache Get | Return cached value | Return None (cache miss) |
| Cache Set | Store in Redis | No-op (skip caching) |
| Rate Limit | Check token bucket | Allow request (bypass) |
| Session Get | Return session | Return None (force re-auth) |
| Lock Acquire | Redis lock | Local threading lock |
Implementation Pattern
class ResilientCache:
async def get(self, key: str) -> Any | None:
try:
if not self.is_connected:
return None
return await self.redis.get(key)
except RedisError as e:
logger.warning(f"Redis get failed: {e}")
self._mark_disconnected()
return None
async def _health_check(self):
"""Periodic check to restore connection."""
try:
await self.redis.ping()
self._mark_connected()
except RedisError:
pass
Logging
# On disconnect
logger.warning("Redis unavailable, operating in degraded mode")
# On reconnect
logger.info("Redis connection restored, resuming normal operation")
# Per-operation (debug level)
logger.debug(f"Cache miss (degraded): {key}")
Consequences
Positive
- High Availability: Application works without Redis
- Development Friendly: No mandatory Redis setup
- Test Simplicity: Unit tests don't need Redis
- Automatic Recovery: No manual intervention needed
- Operational Visibility: Clear logs for degraded state
Negative
- Performance Impact: Degraded mode slower (no cache)
- Rate Limiting Bypass: Abuse protection weakened
- Session Loss: Users may need to re-authenticate
- Complexity: Every operation needs fallback logic
Neutral
- Code Overhead: Additional try/except blocks
- Monitoring Need: Must track degradation frequency
Alternatives Considered
1. Hard Dependency
- Approach: Require Redis, fail startup without it
- Rejected: Reduces availability, harder development
2. In-Memory Fallback
- Approach: Use local memory cache when Redis down
- Rejected: Inconsistent across instances, memory pressure
3. Secondary Redis
- Approach: Failover to backup Redis
- Rejected: Operational complexity, cost
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Resilient Cache:
backend/core/caching/cache_manager.py - Connection Health:
backend/core/redis_manager.py - Rate Limit Bypass:
backend/core/rate_limiting/ - Session Fallback:
backend/core/auth/session.py
Compliance/Validation
- Automated checks: Tests run with and without Redis
- Manual review: Degraded mode tested in staging
- Metrics: Degraded operation count via Prometheus
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL APPROVAL
Quality Metrics
- Consensus Strength Score (CSS): 0.95
- Deliberation Depth Index (DDI): 0.90
Council Feedback Summary
Graceful degradation is the correct strategy for an SRE platform (availability is paramount), but the specific fallback behaviors for Rate Limiting and Sessions are critical vulnerabilities.
Key Concerns Identified:
- Rate Limit → Allow is Critical Security Flaw: Redis outage disables all throttling, enabling credential stuffing and DoS
- Session → Re-auth Causes "Login Loop of Death": If Redis stores sessions and is down, users get logged out, try to login (fails), logged out again - locks SREs out during outages
- Simple try/except is insufficient: Need Circuit Breaker pattern to prevent connection timeout hangs
Required Modifications:
- Tiered Rate Limit Fallback:
- Sensitive Actions (Login, Admin): Fail Closed or strict local limits
- General Traffic: Local in-memory token bucket
- Decouple Session from Redis:
- Preferred: Stateless JWTs (verify signature locally, Redis only for revocation)
- Alternative: Database fallback for session writes
- Adopt Circuit Breaker Pattern: Immediate switch to local limits upon Redis failure
- Refine Metrics: Distinguish "Health Check Failures" from "Degraded Operations"
Modifications Applied
- Documented Circuit Breaker requirement
- Added in-memory rate limiting fallback
- Documented stateless JWT session strategy
- Updated metrics to track degradation types
Council Ranking
- All models reached consensus on security concerns
- claude-opus-4.5: Best Response (session strategy)
- gemini-3-pro: Strong (circuit breaker emphasis)
References
ADR-013 | Caching Layer | Implemented