Skip to main content

ADR-029: Error Budget Alerting

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • SRE Team - SLO/Error Budget design
  • Architecture Team - Alerting patterns

Layer

API

  • ADR-023: Prometheus + OpenTelemetry Metrics

Supersedes

None

Depends On

  • ADR-023: Prometheus + OpenTelemetry Metrics

Context

SRE practices require proactive error budget management:

  1. SLO Tracking: Monitor service level objectives
  2. Error Budget: Track allowed failure margin
  3. Multi-Window Alerts: Detect both acute and chronic issues
  4. Burn Rate: Measure how fast budget is consumed
  5. Actionable Alerts: Meaningful notifications

Requirements:

  • Prevent error budget exhaustion
  • Different alert thresholds for different windows
  • Clear escalation paths
  • Integration with PagerDuty/OpsGenie

Decision

We implement multi-window SLO alerting based on Google SRE practices:

Key Design Decisions

  1. Multi-Window: Different alert windows (1h, 6h, 24h, 30d)
  2. Burn Rate: Alerts based on consumption rate
  3. Threshold Tiers: Critical, warning, info levels
  4. Error Budget API: Real-time budget calculations
  5. Prometheus Rules: AlertManager integration

Alert Windows and Thresholds

WindowBurn RateBudget ConsumedSeverity
1 hour14.4x2%Critical
6 hours6x5%Critical
24 hours3x10%Warning
3 days1x10%Warning

Error Budget Calculation

def calculate_error_budget(slo: SLO) -> ErrorBudgetStatus:
"""Calculate current error budget status."""
# Get SLI measurements for window
measurements = get_sli_measurements(slo.sli_id, slo.window)

# Calculate actual performance
successful = sum(1 for m in measurements if m.meets_threshold)
total = len(measurements)
actual_rate = successful / total if total > 0 else 1.0

# Calculate budget
target = slo.target_percentage / 100 # e.g., 0.999
allowed_failures = 1 - target # 0.001
actual_failures = 1 - actual_rate

budget_remaining = 1 - (actual_failures / allowed_failures)

return ErrorBudgetStatus(
budget_remaining_percent=budget_remaining * 100,
consumed_percent=(1 - budget_remaining) * 100,
is_exhausted=budget_remaining <= 0,
burn_rate=calculate_burn_rate(actual_failures, allowed_failures)
)

API Endpoint

GET /api/v1/error-budgets/EB-000001/status

{
"slo_id": "SLO-000001",
"target": 99.9,
"actual": 99.85,
"budget_remaining_percent": 50.0,
"consumed_percent": 50.0,
"burn_rate": 1.5,
"time_to_exhaustion_hours": 168,
"alert_status": "warning"
}

Prometheus Alert Rules

groups:
- name: slo-alerts
rules:
- alert: ErrorBudgetBurnHigh
expr: error_budget_burn_rate > 6
for: 1h
labels:
severity: critical
annotations:
summary: "High error budget burn rate"

Consequences

Positive

  • Proactive: Alerts before budget exhaustion
  • Actionable: Clear severity and escalation
  • SRE-Aligned: Google SRE best practices
  • Visualization: Dashboard integration
  • API-Driven: Programmatic access to status

Negative

  • Complexity: Multi-window logic is complex
  • Tuning Required: Thresholds need adjustment
  • Alert Fatigue: Over-alerting possible
  • Data Requirements: Needs consistent SLI data

Neutral

  • Integration: Works with any alert manager
  • Customization: Per-SLO threshold overrides possible

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Service: backend/services/error_budget_service.py
  • API: backend/api/v1/error_budgets.py
  • Prometheus: monitoring/prometheus/alerts/
  • Dashboard: monitoring/grafana/dashboards/slo.json

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: REQUEST CHANGES - CRITICAL GAPS

Quality Metrics

  • Consensus Strength Score (CSS): 0.85
  • Deliberation Depth Index (DDI): 0.92

Council Feedback Summary

The burn rate calculations align with Google SRE methodology but the implementation is mathematically incomplete. The critical "paired window" technique is missing, which will cause alert flapping and zombie alerts.

Key Concerns Identified:

  1. Single Window Alerting Flaw: Alerting on single windows causes alerts to take 50-60 minutes to resolve after issue ends
  2. Low Traffic Problem: Single error in low-volume service triggers 100% error rate → false Critical alert
  3. 3-Day 1x Burn Noise: Alerting on 1x burn rate = alerting for "consuming budget as planned" → constant noise
  4. Ambiguous API Response: {"burn_rate": 1.5} doesn't indicate which window

Required Modifications:

  1. Implement Paired Windows: IF (BurnRate_1h > 14.4) AND (BurnRate_5m > 14.4) THEN Alert
  2. Add Short Windows: Include 5m and 30m comparison windows
  3. Minimum Volume Gate: Require rate(requests) > X before alerting
  4. Downgrade 3-Day Severity: Make 3-day window Info/Dashboard only, not paging
  5. Enhanced API Response:
    {
    "window_burn_rates": {"5m": 0.0, "1h": 14.5, "24h": 1.2},
    "alert_active": true
    }
  6. Handle Data Gaps: Explicit handling for missing Prometheus scrapes (NaN)

Modifications Applied

  1. Documented paired window requirement for all alert tiers
  2. Added minimum traffic volume gates
  3. Documented severity downgrade for 3-day window
  4. Enhanced API response specification

Implementation (2025-01-16)

All council feedback has been addressed:

  1. Paired Windows Implemented (backend/services/multi_window_alerting_service.py)

    • check_alert(): BOTH short AND long windows must exceed threshold
    • Standard configs: 5m+1h, 30m+6h, 2h+24h, 6h+3d
  2. Minimum Volume Gate (MultiWindowAlertingService.check_alert_with_volume_gate())

    • Default threshold: 100 requests
    • Prevents false alerts on low-traffic services
  3. Incident Suppression (MultiWindowAlertingService.check_alert_with_incident_suppression())

    • Checks for active incidents before alerting
    • Prevents alert storms during known incidents
  4. Enhanced API Response (GET /api/v1/error-budgets/{id}/status)

    {
    "window_burn_rates": {"5m": {...}, "1h": {...}, "6h": {...}, "24h": {...}},
    "alert_active": true,
    "active_alerts": [{"severity": "page", "short_window": "5m", ...}]
    }
  5. Data Gap Handling: NaN values trigger incomplete_data suppression

  6. 3-Day Window Downgraded: Severity is "ticket" (info/dashboard), not "page"

Tests

  • backend/tests/integration/test_adr029_multi_window_alerting.py
  • Covers: paired windows, volume gate, incident suppression, API response

Council Ranking

  • gpt-5.2: Best Response (paired windows)
  • gemini-3-pro: Strong (edge cases)
  • claude-opus-4.5: Good (API design)

References


ADR-029 | API Layer | Implemented