Skip to main content

ADR-013: Graceful Redis Degradation

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • SRE Team - Reliability requirements
  • Architecture Team - Failure handling

Layer

Caching

  • ADR-011: Redis Single Instance Strategy
  • ADR-012: Token Bucket Rate Limiting

Supersedes

None

Depends On

  • ADR-011: Redis Single Instance Strategy

Context

Redis is a dependency for caching, rate limiting, and sessions, but:

  1. Availability: Redis may be temporarily unavailable
  2. Development: Developers may not run Redis locally
  3. Testing: Unit tests should run without Redis
  4. Resilience: Application should not fail without cache
  5. Performance: Degraded mode should still be usable

Key constraints:

  • Application must start without Redis
  • Core functionality must work without cache
  • Clear logging when degraded
  • Automatic recovery when Redis returns
  • Minimal performance impact from checks

Decision

We implement graceful degradation at every Redis touchpoint:

Key Design Decisions

  1. Try/Except Wrapper: All Redis operations wrapped
  2. Fallback Behavior: Defined for each operation type
  3. Health Monitoring: Periodic Redis connectivity checks
  4. Automatic Recovery: Reconnect when Redis available
  5. Metrics Tracking: Count degraded operations

Degradation Behaviors

OperationNormalDegraded
Cache GetReturn cached valueReturn None (cache miss)
Cache SetStore in RedisNo-op (skip caching)
Rate LimitCheck token bucketAllow request (bypass)
Session GetReturn sessionReturn None (force re-auth)
Lock AcquireRedis lockLocal threading lock

Implementation Pattern

class ResilientCache:
async def get(self, key: str) -> Any | None:
try:
if not self.is_connected:
return None
return await self.redis.get(key)
except RedisError as e:
logger.warning(f"Redis get failed: {e}")
self._mark_disconnected()
return None

async def _health_check(self):
"""Periodic check to restore connection."""
try:
await self.redis.ping()
self._mark_connected()
except RedisError:
pass

Logging

# On disconnect
logger.warning("Redis unavailable, operating in degraded mode")

# On reconnect
logger.info("Redis connection restored, resuming normal operation")

# Per-operation (debug level)
logger.debug(f"Cache miss (degraded): {key}")

Consequences

Positive

  • High Availability: Application works without Redis
  • Development Friendly: No mandatory Redis setup
  • Test Simplicity: Unit tests don't need Redis
  • Automatic Recovery: No manual intervention needed
  • Operational Visibility: Clear logs for degraded state

Negative

  • Performance Impact: Degraded mode slower (no cache)
  • Rate Limiting Bypass: Abuse protection weakened
  • Session Loss: Users may need to re-authenticate
  • Complexity: Every operation needs fallback logic

Neutral

  • Code Overhead: Additional try/except blocks
  • Monitoring Need: Must track degradation frequency

Alternatives Considered

1. Hard Dependency

  • Approach: Require Redis, fail startup without it
  • Rejected: Reduces availability, harder development

2. In-Memory Fallback

  • Approach: Use local memory cache when Redis down
  • Rejected: Inconsistent across instances, memory pressure

3. Secondary Redis

  • Approach: Failover to backup Redis
  • Rejected: Operational complexity, cost

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Resilient Cache: backend/core/caching/cache_manager.py
  • Connection Health: backend/core/redis_manager.py
  • Rate Limit Bypass: backend/core/rate_limiting/
  • Session Fallback: backend/core/auth/session.py

Compliance/Validation

  • Automated checks: Tests run with and without Redis
  • Manual review: Degraded mode tested in staging
  • Metrics: Degraded operation count via Prometheus

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL APPROVAL

Quality Metrics

  • Consensus Strength Score (CSS): 0.95
  • Deliberation Depth Index (DDI): 0.90

Council Feedback Summary

Graceful degradation is the correct strategy for an SRE platform (availability is paramount), but the specific fallback behaviors for Rate Limiting and Sessions are critical vulnerabilities.

Key Concerns Identified:

  1. Rate Limit → Allow is Critical Security Flaw: Redis outage disables all throttling, enabling credential stuffing and DoS
  2. Session → Re-auth Causes "Login Loop of Death": If Redis stores sessions and is down, users get logged out, try to login (fails), logged out again - locks SREs out during outages
  3. Simple try/except is insufficient: Need Circuit Breaker pattern to prevent connection timeout hangs

Required Modifications:

  1. Tiered Rate Limit Fallback:
    • Sensitive Actions (Login, Admin): Fail Closed or strict local limits
    • General Traffic: Local in-memory token bucket
  2. Decouple Session from Redis:
    • Preferred: Stateless JWTs (verify signature locally, Redis only for revocation)
    • Alternative: Database fallback for session writes
  3. Adopt Circuit Breaker Pattern: Immediate switch to local limits upon Redis failure
  4. Refine Metrics: Distinguish "Health Check Failures" from "Degraded Operations"

Modifications Applied

  1. Documented Circuit Breaker requirement
  2. Added in-memory rate limiting fallback
  3. Documented stateless JWT session strategy
  4. Updated metrics to track degradation types

Council Ranking

  • All models reached consensus on security concerns
  • claude-opus-4.5: Best Response (session strategy)
  • gemini-3-pro: Strong (circuit breaker emphasis)

References


ADR-013 | Caching Layer | Implemented