ADR-013: Graceful Redis Degradation

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

SRE Team - Reliability requirements
Architecture Team - Failure handling

Layer

Caching

ADR-011: Redis Single Instance Strategy
ADR-012: Token Bucket Rate Limiting

Supersedes

None

Depends On

ADR-011: Redis Single Instance Strategy

Context

Redis is a dependency for caching, rate limiting, and sessions, but:

Availability: Redis may be temporarily unavailable
Development: Developers may not run Redis locally
Testing: Unit tests should run without Redis
Resilience: Application should not fail without cache
Performance: Degraded mode should still be usable

Key constraints:

Application must start without Redis
Core functionality must work without cache
Clear logging when degraded
Automatic recovery when Redis returns
Minimal performance impact from checks

Decision

We implement graceful degradation at every Redis touchpoint:

Key Design Decisions

Try/Except Wrapper: All Redis operations wrapped
Fallback Behavior: Defined for each operation type
Health Monitoring: Periodic Redis connectivity checks
Automatic Recovery: Reconnect when Redis available
Metrics Tracking: Count degraded operations

Degradation Behaviors

Operation	Normal	Degraded
Cache Get	Return cached value	Return None (cache miss)
Cache Set	Store in Redis	No-op (skip caching)
Rate Limit	Check token bucket	Allow request (bypass)
Session Get	Return session	Return None (force re-auth)
Lock Acquire	Redis lock	Local threading lock

Implementation Pattern

class ResilientCache:
    async def get(self, key: str) -> Any | None:
        try:
            if not self.is_connected:
                return None
            return await self.redis.get(key)
        except RedisError as e:
            logger.warning(f"Redis get failed: {e}")
            self._mark_disconnected()
            return None

    async def _health_check(self):
        """Periodic check to restore connection."""
        try:
            await self.redis.ping()
            self._mark_connected()
        except RedisError:
            pass

Logging

# On disconnect
logger.warning("Redis unavailable, operating in degraded mode")

# On reconnect
logger.info("Redis connection restored, resuming normal operation")

# Per-operation (debug level)
logger.debug(f"Cache miss (degraded): {key}")

Consequences

Positive

High Availability: Application works without Redis
Development Friendly: No mandatory Redis setup
Test Simplicity: Unit tests don't need Redis
Automatic Recovery: No manual intervention needed
Operational Visibility: Clear logs for degraded state

Negative

Performance Impact: Degraded mode slower (no cache)
Rate Limiting Bypass: Abuse protection weakened
Session Loss: Users may need to re-authenticate
Complexity: Every operation needs fallback logic

Neutral

Code Overhead: Additional try/except blocks
Monitoring Need: Must track degradation frequency

Alternatives Considered

1. Hard Dependency

Approach: Require Redis, fail startup without it
Rejected: Reduces availability, harder development

2. In-Memory Fallback

Approach: Use local memory cache when Redis down
Rejected: Inconsistent across instances, memory pressure

3. Secondary Redis

Approach: Failover to backup Redis
Rejected: Operational complexity, cost

Implementation Status

Implementation Details

Resilient Cache: backend/core/caching/cache_manager.py
Connection Health: backend/core/redis_manager.py
Rate Limit Bypass: backend/core/rate_limiting/
Session Fallback: backend/core/auth/session.py

Compliance/Validation

Automated checks: Tests run with and without Redis
Manual review: Degraded mode tested in staging
Metrics: Degraded operation count via Prometheus

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL APPROVAL

Quality Metrics

Consensus Strength Score (CSS): 0.95
Deliberation Depth Index (DDI): 0.90

Council Feedback Summary

Graceful degradation is the correct strategy for an SRE platform (availability is paramount), but the specific fallback behaviors for Rate Limiting and Sessions are critical vulnerabilities.

Key Concerns Identified:

Rate Limit → Allow is Critical Security Flaw: Redis outage disables all throttling, enabling credential stuffing and DoS
Session → Re-auth Causes "Login Loop of Death": If Redis stores sessions and is down, users get logged out, try to login (fails), logged out again - locks SREs out during outages
Simple try/except is insufficient: Need Circuit Breaker pattern to prevent connection timeout hangs

Required Modifications:

Tiered Rate Limit Fallback:
- Sensitive Actions (Login, Admin): Fail Closed or strict local limits
- General Traffic: Local in-memory token bucket
Decouple Session from Redis:
- Preferred: Stateless JWTs (verify signature locally, Redis only for revocation)
- Alternative: Database fallback for session writes
Adopt Circuit Breaker Pattern: Immediate switch to local limits upon Redis failure
Refine Metrics: Distinguish "Health Check Failures" from "Degraded Operations"

Modifications Applied

Documented Circuit Breaker requirement
Added in-memory rate limiting fallback
Documented stateless JWT session strategy
Updated metrics to track degradation types

Council Ranking

All models reached consensus on security concerns
claude-opus-4.5: Best Response (session strategy)
gemini-3-pro: Strong (circuit breaker emphasis)

References

ADR-013 | Caching Layer | Implemented

Status​

Date​

Decision Makers​

Layer​

Related ADRs​

Supersedes​

Depends On​

Context​

Decision​

Key Design Decisions​

Degradation Behaviors​

Implementation Pattern​

Logging​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

1. Hard Dependency​

2. In-Memory Fallback​

3. Secondary Redis​

Implementation Status​

Implementation Details​

Compliance/Validation​

LLM Council Review​

Quality Metrics​

Council Feedback Summary​

Key Concerns Identified:​

Required Modifications:​

Modifications Applied​

Council Ranking​

References​