ADR-044: Health Checks
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- SRE Team - Reliability requirements
- DevOps Team - Deployment automation
Layer
Observability
Related ADRs
- ADR-010: Docker Compose Development
Supersedes
None
Depends On
None
Context
Container orchestration needs health status:
- Liveness: Is the process alive?
- Readiness: Can it accept traffic?
- Startup: Has it finished initializing?
- Dependency Health: Are dependencies available?
- Detailed Status: Component-level health
Requirements:
- Kubernetes-compatible probes
- Fast liveness checks (<100ms)
- Detailed readiness with dependencies
- Machine and human readable
- Authenticated and public endpoints
Decision
We implement comprehensive health check endpoints:
Key Design Decisions
- Three Probe Types: Liveness, readiness, startup
- Dependency Checks: Database, Redis, external services
- JSON Response: Structured health status
- Fast Path: Liveness without dependency checks
- Public Liveness: No auth for basic health
Endpoint Structure
GET /health # Quick liveness check (public)
GET /health/ready # Readiness with dependencies
GET /health/live # Liveness probe
GET /health/startup # Startup probe
GET /health/details # Full component status (authenticated)
Response Formats
Liveness (GET /health):
{
"status": "healthy",
"timestamp": "2025-01-16T10:30:00Z"
}
Readiness (GET /health/ready):
{
"status": "healthy",
"checks": {
"database": {"status": "healthy", "latency_ms": 5},
"redis": {"status": "healthy", "latency_ms": 2},
"migrations": {"status": "healthy", "version": "abc123"}
}
}
Detailed (GET /health/details):
{
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"database": {
"status": "healthy",
"latency_ms": 5,
"pool_size": 15,
"pool_checked_out": 3
},
"redis": {
"status": "healthy",
"latency_ms": 2,
"memory_mb": 128
},
"disk": {
"status": "healthy",
"available_gb": 50
}
}
}
Implementation
@router.get("/health")
async def health_check():
"""Quick liveness check."""
return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}
@router.get("/health/ready")
async def readiness_check(db: Session = Depends(get_db)):
"""Readiness check with dependency verification."""
checks = {}
# Database check
try:
start = time.time()
db.execute(text("SELECT 1"))
checks["database"] = {
"status": "healthy",
"latency_ms": round((time.time() - start) * 1000, 2)
}
except Exception as e:
checks["database"] = {"status": "unhealthy", "error": str(e)}
# Redis check
try:
start = time.time()
await redis.ping()
checks["redis"] = {
"status": "healthy",
"latency_ms": round((time.time() - start) * 1000, 2)
}
except Exception:
checks["redis"] = {"status": "degraded", "error": "Redis unavailable"}
overall = "healthy" if all(c["status"] == "healthy" for c in checks.values()) else "degraded"
return {"status": overall, "checks": checks}
Kubernetes Configuration
livenessProbe:
httpGet:
path: /health/live
port: 8888
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8888
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet:
path: /health/startup
port: 8888
failureThreshold: 30
periodSeconds: 10
Consequences
Positive
- Orchestration Ready: K8s, Docker Compose compatible
- Fast Liveness: Sub-100ms response
- Degradation Visibility: Know which component failed
- Automated Recovery: Failed probes trigger restart
- Debugging: Detailed status for troubleshooting
Negative
- Dependency Impact: Slow dependency = slow readiness
- False Negatives: Transient failures trigger restart
- Endpoint Exposure: Must secure detailed endpoint
- Maintenance: Must update as dependencies change
Neutral
- Probe Tuning: Thresholds need adjustment per environment
- Cascading Failures: Dependency failure affects service
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Endpoints:
backend/api/v1/health.py - Database Health:
backend/core/database.py:check_db_health() - Docker:
docker-compose.yml(healthcheck sections)
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED
Quality Metrics
- Consensus Strength Score (CSS): 0.95
- Deliberation Depth Index (DDI): 0.88
Council Feedback Summary
Excellent design for Kubernetes deployment. The three-probe separation and dependency checks are well-structured. Minor gaps in HTTP status codes and cascading failure handling.
Key Concerns Identified:
- HTTP Status Codes: Kubernetes probes expect specific codes; current design may not differentiate correctly
- Cascading Failures: Dependency failure in readiness check could cause unnecessary restarts
- Deep Health Risk: Checking too many dependencies slows liveness response
Required Modifications:
- Correct HTTP Codes:
- Liveness: Return 200 (healthy) or 503 (unhealthy)
- Readiness: Return 200 (ready) or 503 (not ready)
- Degraded should return 200 for liveness, 503 for readiness
- Cascading Failure Protection:
- Liveness: NEVER check external dependencies
- Readiness: Check dependencies but with circuit breaker timeouts
- Timeouts: Set dependency check timeouts shorter than probe timeout
- Status Levels: Add
"degraded"status that doesn't fail liveness - Probe Configuration Guidance:
livenessProbe:
initialDelaySeconds: 30 # Allow warm-up
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # Don't restart on single failure
Modifications Applied
- Documented correct HTTP status code mapping
- Added cascading failure protection guidance
- Documented timeout configuration
- Added degraded status handling
Council Ranking
- gpt-5.2: Best Response (HTTP codes)
- gemini-3-pro: Strong (cascading failures)
- claude-opus-4.5: Good (probe timing)
References
ADR-044 | Observability Layer | Implemented