ADR-013: AIOps - LLM-Powered Root Cause Analysis and Auto-Resolution
Status
Conditionally Approved - Phase 1 (Analysis Only) approved. Phase 2+ requires addressing Council feedback.
LLM Council Review (2026-01-20)
Verdict: Conditional Approval (Phase 1 Only)
Key Findings:
- Raw LLM confidence scores are unreliable for automation decisions
- Cache clearing poses thundering herd risk - moved to Tier 3
- Pod restarts require precondition verification
- Post-action SLO verification is mandatory
Required Changes for Phase 2:
- Replace LLM confidence thresholds with policy-based scoring
- Add circuit breakers and remediation safety controller
- Implement precondition verification for all safe ops
- Add post-action SLO verification with automatic rollback
Date
2026-01-20
Context
Habit Hub uses Grafana Faro for Real User Monitoring (RUM) and OpenTelemetry for distributed tracing. Currently, when issues occur:
- Alerts fire based on static thresholds
- Engineers manually investigate traces, logs, and metrics
- Root cause analysis is time-consuming and requires domain expertise
- Resolution depends on engineer availability
Problem Statement: Can we leverage LLMs to automatically analyze observability data, identify root causes, and propose or execute fixes?
Current Observability Stack
Browser → Faro SDK → OTel Collector → Tempo (traces)
→ Loki (logs)
→ Prometheus (metrics)
→ Grafana (visualization)
→ AlertManager (alerting)
Opportunity
Modern LLMs can:
- Understand stack traces and error messages
- Correlate events across distributed systems
- Reason about code behavior given context
- Generate fixes based on patterns
Decision Drivers
- Mean Time to Resolution (MTTR): Reduce from hours to minutes
- On-Call Burden: Reduce toil for engineers
- Consistency: Apply same analysis rigor to all incidents
- Knowledge Capture: Encode institutional knowledge in prompts
- Safety: Prevent auto-resolution from causing more harm
- Cost: LLM API costs vs. engineer time savings
- Privacy: Ensure no PII leaks to LLM providers
Considered Options
Option A: Alert-Triggered Analysis (Reactive)
AlertManager → Webhook → AIOps Agent → LLM Analysis → Action
Flow:
- Grafana alert fires (error rate spike, latency threshold, etc.)
- Webhook triggers AIOps agent
- Agent gathers context (traces, logs, metrics, relevant code)
- LLM analyzes and produces root cause hypothesis
- Based on confidence, either:
- Auto-apply safe fix (config change, restart)
- Create PR for code fix
- Alert on-call with analysis
Pros:
- Simple integration with existing alerting
- Only runs when needed (cost-efficient)
- Clear trigger → action model
Cons:
- Reactive, not predictive
- Alert fatigue can delay analysis
- Dependent on good alert configuration
Option B: Continuous Anomaly Detection (Proactive)
Telemetry Stream → ML Anomaly Detection → LLM Analysis → Action
Flow:
- ML model continuously analyzes telemetry streams
- Detects anomalies before they become incidents
- LLM investigates anomalies proactively
- Alerts or fixes before users are impacted
Pros:
- Catches issues before user impact
- Can detect subtle degradations
- Learns normal patterns over time
Cons:
- Higher compute/API costs (always running)
- More false positives initially
- Requires ML infrastructure (not just LLM)
Option C: Hybrid Approach (Recommended)
Combine reactive alerts with proactive pattern analysis:
┌─────────────────────────────────────────────────────────────────┐
│ AIOps Control Plane │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Reactive │ │ Proactive │ │ On-Demand │ │
│ │ (Alerts) │ │ (Anomaly) │ │ (Manual) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Context Gatherer │ │
│ │ - Traces (Tempo) │ │
│ │ - Logs (Loki) │ │
│ │ - Metrics (Prom) │ │
│ │ - Code (GitHub) │ │
│ │ - History (DB) │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ LLM Analyzer │ │
│ │ - Root Cause │ │
│ │ - Impact Scope │ │
│ │ - Fix Proposal │ │
│ │ - Confidence │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ LLM Council │ │
│ │ (High Stakes) │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Action Router │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Auto-Apply │ │ Create PR │ │ Alert │ │
│ │ (Safe Ops) │ │ (Code Fix) │ │ (Human) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Proposed Architecture
Component 1: Context Gatherer
Collects relevant data when an investigation is triggered:
class ContextGatherer:
async def gather(self, trigger: InvestigationTrigger) -> IncidentContext:
return IncidentContext(
# Observability data
traces=await self.tempo.get_traces(trigger.trace_id, window="5m"),
logs=await self.loki.query(trigger.log_query, window="10m"),
metrics=await self.prometheus.query(trigger.metric_queries),
# Code context
stack_trace=trigger.stack_trace,
relevant_code=await self.github.get_files(
self.extract_files_from_trace(trigger.stack_trace)
),
recent_commits=await self.github.get_commits(window="24h"),
# Historical context
similar_incidents=await self.incident_db.find_similar(trigger),
runbooks=await self.runbook_db.find_relevant(trigger.error_type),
# System state
deployments=await self.argocd.get_recent_deployments(),
config_changes=await self.config_db.get_recent_changes(),
)
Component 2: LLM Analyzer
Analyzes context and produces structured output:
class LLMAnalyzer:
async def analyze(self, context: IncidentContext) -> Analysis:
prompt = self.build_prompt(context)
response = await self.llm.complete(
model="claude-sonnet-4-20250514", # Fast for initial analysis
system=AIOPS_SYSTEM_PROMPT,
messages=[{"role": "user", "content": prompt}],
response_format=AnalysisSchema # Structured output
)
return Analysis(
root_cause=response.root_cause,
root_cause_confidence=response.confidence,
impact_scope=response.impact_scope,
affected_users_estimate=response.affected_users,
proposed_fixes=response.fixes,
fix_confidence=response.fix_confidence,
requires_council_review=response.confidence < 0.8 or response.is_high_risk,
supporting_evidence=response.evidence,
)
Component 3: LLM Council Integration
For high-stakes decisions, use multi-model consensus:
class AIOpsCouncil:
async def review_fix(self, analysis: Analysis, fix: ProposedFix) -> CouncilVerdict:
return await self.council.consult(
query=f"""
## Incident Analysis
{analysis.to_markdown()}
## Proposed Fix
{fix.to_markdown()}
## Questions for Council
1. Is the root cause analysis correct?
2. Will the proposed fix resolve the issue?
3. What are the risks of applying this fix?
4. Should this be auto-applied or require human approval?
""",
confidence="high", # Full council deliberation
verdict_type="binary", # approve/reject
include_dissent=True,
)
Component 4: Action Router
Routes decisions based on policy scoring and deterministic verification (not raw LLM confidence):
class ActionRouter:
# Tier 1: Pre-approved safe operations (requires precondition verification)
SAFE_OPS = {
"restart_pod": {
"max_replicas": 1,
"cooldown": "5m",
"preconditions": ["not_during_deploy", "pod_healthy_recently", "no_pending_migrations"],
},
"scale_up": {
"max_increase": 2,
"max_total": 10,
"preconditions": ["under_resource_quota", "no_pending_scale_ops"],
},
"toggle_feature_flag": {
"flags": ["maintenance_mode"],
"preconditions": ["flag_in_allowlist"],
},
"rollback_config": {
"window": "1h",
"preconditions": ["config_has_previous_version", "no_dependent_changes"],
},
}
# Tier 3: Always require human approval (moved from Tier 1 per Council review)
HUMAN_REQUIRED_OPS = {
"clear_cache": {"reason": "Thundering herd risk - requires human judgment"},
"database_migration": {"reason": "Data integrity risk"},
"code_deployment": {"reason": "Broad impact"},
"security_fix": {"reason": "Security-sensitive"},
"data_modification": {"reason": "Data integrity risk"},
}
async def route(self, analysis: Analysis) -> ActionResult:
for fix in analysis.proposed_fixes:
# Calculate policy-based score (not raw LLM confidence)
policy_score = await self.calculate_policy_score(analysis, fix)
# Tier 1: Auto-apply safe operations with precondition verification
if fix.type in self.SAFE_OPS:
preconditions_met = await self.verify_preconditions(fix)
if not preconditions_met:
return await self.alert_oncall(analysis, fix,
reason="Preconditions not met")
if policy_score.auto_apply_eligible:
result = await self.auto_apply_with_circuit_breaker(fix)
# Post-action SLO verification
await self.verify_slo_after_action(result, rollback_on_failure=True)
return result
# Tier 2: Council review for code changes
elif fix.type == "code_change":
verdict = await self.council.review_fix(analysis, fix)
return await self.create_pr(fix, auto_merge=False,
label="needs-human-review")
# Tier 3: Alert humans for high-risk operations
return await self.alert_oncall(analysis, fix)
async def calculate_policy_score(self, analysis: Analysis, fix: ProposedFix) -> PolicyScore:
"""
Deterministic policy-based scoring instead of raw LLM confidence.
Factors: incident type, fix type, recent failure history, system state.
"""
return PolicyScore(
auto_apply_eligible=(
fix.type in self.SAFE_OPS
and analysis.incident_type in ["high_latency", "error_spike", "resource_exhaustion"]
and self.circuit_breaker.is_closed(fix.type)
and await self.no_recent_failures(fix.type, window="1h")
),
requires_council=fix.type == "code_change" or analysis.is_high_risk,
requires_human=fix.type in self.HUMAN_REQUIRED_OPS,
)
async def auto_apply_with_circuit_breaker(self, fix: ProposedFix) -> ActionResult:
"""Apply fix with circuit breaker protection."""
if self.circuit_breaker.is_open(fix.type):
raise CircuitBreakerOpenError(f"Too many recent failures for {fix.type}")
try:
result = await self.auto_apply(fix)
self.circuit_breaker.record_success(fix.type)
return result
except Exception as e:
self.circuit_breaker.record_failure(fix.type)
raise
async def verify_slo_after_action(self, result: ActionResult, rollback_on_failure: bool):
"""Verify SLOs are met after action, rollback if not."""
await asyncio.sleep(30) # Wait for metrics to stabilize
slo_check = await self.prometheus.check_slos(result.affected_service)
if not slo_check.passed and rollback_on_failure:
await self.rollback(result)
await self.alert_oncall(result.analysis, result.fix,
reason=f"Auto-rollback: SLO degradation detected")
Component 5: Feedback Loop
Learn from outcomes to improve:
class FeedbackCollector:
async def record_outcome(self, incident_id: str, outcome: Outcome):
await self.db.save({
"incident_id": incident_id,
"analysis": outcome.analysis,
"action_taken": outcome.action,
"resolution_time": outcome.resolution_time,
"was_correct": outcome.human_feedback, # Did the fix work?
"false_positive": outcome.was_false_positive,
})
# Periodic retraining of prompts based on outcomes
if await self.should_update_prompts():
await self.update_prompts_from_feedback()
Auto-Resolution Safety Matrix
Updated per LLM Council Review: Cache clearing moved to Tier 3 due to thundering herd risk. All Tier 1 operations now require precondition verification and post-action SLO checks.
| Fix Type | Tier | Auto-Apply? | Preconditions | Rollback | SLO Check |
|---|---|---|---|---|---|
| Pod restart | 1 | Yes | Not during deploy, pod healthy recently, no migrations | N/A | Yes |
| Scale up | 1 | Yes | Under quota, no pending scale ops | Scale down | Yes |
| Config rollback | 1 | Yes | Previous version exists, no dependent changes | Re-apply | Yes |
| Feature flag off | 1 | Yes | Flag in allowlist | Re-enable | Yes |
| Cache clear | 3 | No | N/A - Requires human approval (thundering herd risk) | N/A | N/A |
| Code changes | 2 | No | Council review, PR created | Revert PR | N/A |
| Database migration | 3 | No | Always human approval | Manual | N/A |
| Security fix | 3 | No | Alert only, never auto-fix | N/A | N/A |
| Data modification | 3 | No | Never auto-apply | N/A | N/A |
Circuit Breaker Configuration
CIRCUIT_BREAKER_CONFIG = {
"restart_pod": {"failure_threshold": 3, "reset_timeout": "30m"},
"scale_up": {"failure_threshold": 2, "reset_timeout": "1h"},
"rollback_config": {"failure_threshold": 2, "reset_timeout": "1h"},
"toggle_feature_flag": {"failure_threshold": 3, "reset_timeout": "30m"},
}
Privacy & Security Considerations
Data Sent to LLM
class PIISanitizer:
def sanitize(self, context: IncidentContext) -> SanitizedContext:
return SanitizedContext(
# Sanitize logs
logs=self.redact_pii(context.logs),
# Hash user identifiers
user_ids=self.hash_ids(context.affected_users),
# Remove secrets from stack traces
stack_trace=self.redact_secrets(context.stack_trace),
# Code is OK (already public or internal)
code=context.relevant_code,
)
PII_PATTERNS = [
r'\b[\w.-]+@[\w.-]+\.\w+\b', # Emails
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'Bearer\s+[A-Za-z0-9-_]+', # Tokens
r'password["\s:=]+\S+', # Passwords
]
Audit Trail
All AIOps actions are logged:
@dataclass
class AIOpsAuditLog:
timestamp: datetime
trigger_type: str # alert, anomaly, manual
trigger_id: str
context_hash: str # Hash of data sent to LLM
analysis_id: str
llm_model: str
council_verdict: Optional[CouncilVerdict]
action_taken: str
action_result: str
human_override: Optional[str]
Implementation Phases
Phase 1: Analysis Only (2 weeks) ✅ APPROVED
- Build Context Gatherer
- Implement LLM Analyzer with structured output
- Create Slack/Linear integration for analysis reports
- No auto-resolution, humans take action
- Risk: Low - read-only analysis
- Status: Approved by LLM Council
Phase 2: Safe Auto-Resolution (3 weeks) ⚠️ REQUIRES COUNCIL RE-REVIEW
Council Requirements: Must implement the following before Phase 2 deployment:
- Implement policy-based scoring (replace raw LLM confidence thresholds)
- Add precondition verification for all safe operations
- Implement circuit breakers per operation type
- Add post-action SLO verification with automatic rollback
- Add pod restart, scale up (cache clear moved to Tier 3)
- Add comprehensive audit logging
- Create runbook for circuit breaker resets
- Risk: Medium - limited blast radius with safety controls
- Status: Blocked until safety controls implemented
Phase 3: Council-Reviewed Code Fixes (3 weeks)
- Integrate LLM Council for fix review
- Auto-create PRs with proposed fixes
- Add feedback collection
- Human approval required for merge
- Risk: Medium - humans in the loop
Phase 4: Proactive Anomaly Detection (4 weeks)
- Add ML-based anomaly detection
- Trigger analysis before alerts fire
- Learn baseline patterns per service
- Reduce false positives over time
- Risk: Medium - may generate noise initially
Cost Analysis
LLM API Costs (Estimated)
| Scenario | Incidents/Month | Tokens/Incident | Cost/Month |
|---|---|---|---|
| Low volume | 50 | 50,000 | ~$50 |
| Medium volume | 200 | 50,000 | ~$200 |
| High volume | 1,000 | 50,000 | ~$1,000 |
ROI Calculation
| Metric | Before AIOps | After AIOps | Savings |
|---|---|---|---|
| MTTR | 2 hours | 15 minutes | 87% |
| On-call pages/week | 20 | 5 | 75% |
| Engineer hours/incident | 3 | 0.5 | 83% |
Break-even: ~10 incidents/month at $100/engineer-hour
Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Root cause accuracy | >80% | Human validation sampling |
| Auto-resolution success | >95% | Fix effectiveness tracking |
| MTTR reduction | >50% | Incident timeline analysis |
| False positive rate | <10% | Manual review of analyses |
| Engineer satisfaction | >4/5 | Quarterly survey |
Decision
Recommended: Option C (Hybrid Approach), implemented in phases starting with analysis-only.
Rationale
- Low risk start: Phase 1 is read-only, no production impact
- Incremental value: Each phase delivers measurable improvement
- Human oversight: Council review for high-stakes decisions
- Learning system: Feedback loop improves over time
- Cost-effective: Only runs when needed (alert-triggered)
Open Questions - Council Responses
These questions were reviewed by the LLM Council on 2026-01-20.
-
Confidence thresholds: Are 0.9 (auto-apply) and 0.8 (council review) appropriate?
Council Response: ❌ Raw LLM confidence scores are unreliable for automation decisions. Resolution: Replace with policy-based scoring using deterministic factors (incident type, fix type, recent failure history, system state).
-
Safe operations list: Should any operations be added/removed from auto-apply?
Council Response: ❌ Remove
clear_cache- thundering herd risk makes it unsafe for auto-apply. Resolution: Cache clearing moved to Tier 3 (human required). Added precondition verification for remaining Tier 1 ops. -
Privacy concerns: Is the PII sanitization approach sufficient?
Council Response: ✅ Acceptable for Phase 1. Consider additional patterns for API keys and OAuth tokens.
-
Cost vs. value: Is the ROI analysis realistic?
Council Response: ✅ ROI estimates are reasonable. Phase 1 (analysis-only) provides value with minimal cost.
-
Failure modes: What happens if the LLM gives bad advice consistently?
Council Response: ⚠️ Add circuit breakers to prevent cascading failures from bad recommendations. Resolution: Implemented circuit breaker pattern with failure thresholds and automatic disable.
References
- Grafana Faro Documentation
- OpenTelemetry Specification
- Google SRE Book - Incident Response
- ADR-012: Security Scanning Strategy
- OBSERVABILITY_IMPLEMENTATION.md