ADR-044: Health Checks

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

SRE Team - Reliability requirements
DevOps Team - Deployment automation

Layer

Observability

ADR-010: Docker Compose Development

Supersedes

None

Depends On

None

Context

Container orchestration needs health status:

Liveness: Is the process alive?
Readiness: Can it accept traffic?
Startup: Has it finished initializing?
Dependency Health: Are dependencies available?
Detailed Status: Component-level health

Requirements:

Kubernetes-compatible probes
Fast liveness checks (<100ms)
Detailed readiness with dependencies
Machine and human readable
Authenticated and public endpoints

Decision

We implement comprehensive health check endpoints:

Key Design Decisions

Three Probe Types: Liveness, readiness, startup
Dependency Checks: Database, Redis, external services
JSON Response: Structured health status
Fast Path: Liveness without dependency checks
Public Liveness: No auth for basic health

Endpoint Structure

GET /health         # Quick liveness check (public)
GET /health/ready   # Readiness with dependencies
GET /health/live    # Liveness probe
GET /health/startup # Startup probe
GET /health/details # Full component status (authenticated)

Response Formats

Liveness (GET /health):

{
  "status": "healthy",
  "timestamp": "2025-01-16T10:30:00Z"
}

Readiness (GET /health/ready):

{
  "status": "healthy",
  "checks": {
    "database": {"status": "healthy", "latency_ms": 5},
    "redis": {"status": "healthy", "latency_ms": 2},
    "migrations": {"status": "healthy", "version": "abc123"}
  }
}

Detailed (GET /health/details):

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime_seconds": 86400,
  "checks": {
    "database": {
      "status": "healthy",
      "latency_ms": 5,
      "pool_size": 15,
      "pool_checked_out": 3
    },
    "redis": {
      "status": "healthy",
      "latency_ms": 2,
      "memory_mb": 128
    },
    "disk": {
      "status": "healthy",
      "available_gb": 50
    }
  }
}

Implementation

@router.get("/health")
async def health_check():
    """Quick liveness check."""
    return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}

@router.get("/health/ready")
async def readiness_check(db: Session = Depends(get_db)):
    """Readiness check with dependency verification."""
    checks = {}

    # Database check
    try:
        start = time.time()
        db.execute(text("SELECT 1"))
        checks["database"] = {
            "status": "healthy",
            "latency_ms": round((time.time() - start) * 1000, 2)
        }
    except Exception as e:
        checks["database"] = {"status": "unhealthy", "error": str(e)}

    # Redis check
    try:
        start = time.time()
        await redis.ping()
        checks["redis"] = {
            "status": "healthy",
            "latency_ms": round((time.time() - start) * 1000, 2)
        }
    except Exception:
        checks["redis"] = {"status": "degraded", "error": "Redis unavailable"}

    overall = "healthy" if all(c["status"] == "healthy" for c in checks.values()) else "degraded"
    return {"status": overall, "checks": checks}

Kubernetes Configuration

livenessProbe:
  httpGet:
    path: /health/live
    port: 8888
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8888
  initialDelaySeconds: 5
  periodSeconds: 5

startupProbe:
  httpGet:
    path: /health/startup
    port: 8888
  failureThreshold: 30
  periodSeconds: 10

Consequences

Positive

Orchestration Ready: K8s, Docker Compose compatible
Fast Liveness: Sub-100ms response
Degradation Visibility: Know which component failed
Automated Recovery: Failed probes trigger restart
Debugging: Detailed status for troubleshooting

Negative

Dependency Impact: Slow dependency = slow readiness
False Negatives: Transient failures trigger restart
Endpoint Exposure: Must secure detailed endpoint
Maintenance: Must update as dependencies change

Neutral

Probe Tuning: Thresholds need adjustment per environment
Cascading Failures: Dependency failure affects service

Implementation Status

Implementation Details

Endpoints: backend/api/v1/health.py
Database Health: backend/core/database.py:check_db_health()
Docker: docker-compose.yml (healthcheck sections)

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED

Quality Metrics

Consensus Strength Score (CSS): 0.95
Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

Excellent design for Kubernetes deployment. The three-probe separation and dependency checks are well-structured. Minor gaps in HTTP status codes and cascading failure handling.

Key Concerns Identified:

HTTP Status Codes: Kubernetes probes expect specific codes; current design may not differentiate correctly
Cascading Failures: Dependency failure in readiness check could cause unnecessary restarts
Deep Health Risk: Checking too many dependencies slows liveness response

Required Modifications:

Correct HTTP Codes:
- Liveness: Return 200 (healthy) or 503 (unhealthy)
- Readiness: Return 200 (ready) or 503 (not ready)
- Degraded should return 200 for liveness, 503 for readiness
Cascading Failure Protection:
- Liveness: NEVER check external dependencies
- Readiness: Check dependencies but with circuit breaker timeouts
Timeouts: Set dependency check timeouts shorter than probe timeout
Status Levels: Add "degraded" status that doesn't fail liveness

Probe Configuration Guidance:

livenessProbe:
  initialDelaySeconds: 30  # Allow warm-up
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3  # Don't restart on single failure

Modifications Applied

Documented correct HTTP status code mapping
Added cascading failure protection guidance
Documented timeout configuration
Added degraded status handling

Council Ranking

gpt-5.2: Best Response (HTTP codes)
gemini-3-pro: Strong (cascading failures)
claude-opus-4.5: Good (probe timing)

References

ADR-044 | Observability Layer | Implemented

Status​

Date​

Decision Makers​

Layer​

Related ADRs​

Supersedes​

Depends On​

Context​

Decision​

Key Design Decisions​

Endpoint Structure​

Response Formats​

Implementation​

Kubernetes Configuration​

Consequences​

Positive​

Negative​

Neutral​

Implementation Status​

Implementation Details​

LLM Council Review​

Quality Metrics​

Council Feedback Summary​

Key Concerns Identified:​

Required Modifications:​

Modifications Applied​

Council Ranking​

References​