Skip to main content

ADR-013: AIOps - LLM-Powered Root Cause Analysis and Auto-Resolution

Status

Conditionally Approved - Phase 1 (Analysis Only) approved. Phase 2+ requires addressing Council feedback.

LLM Council Review (2026-01-20)

Verdict: Conditional Approval (Phase 1 Only)

Key Findings:

  • Raw LLM confidence scores are unreliable for automation decisions
  • Cache clearing poses thundering herd risk - moved to Tier 3
  • Pod restarts require precondition verification
  • Post-action SLO verification is mandatory

Required Changes for Phase 2:

  1. Replace LLM confidence thresholds with policy-based scoring
  2. Add circuit breakers and remediation safety controller
  3. Implement precondition verification for all safe ops
  4. Add post-action SLO verification with automatic rollback

Date

2026-01-20

Context

Habit Hub uses Grafana Faro for Real User Monitoring (RUM) and OpenTelemetry for distributed tracing. Currently, when issues occur:

  1. Alerts fire based on static thresholds
  2. Engineers manually investigate traces, logs, and metrics
  3. Root cause analysis is time-consuming and requires domain expertise
  4. Resolution depends on engineer availability

Problem Statement: Can we leverage LLMs to automatically analyze observability data, identify root causes, and propose or execute fixes?

Current Observability Stack

Browser → Faro SDK → OTel Collector → Tempo (traces)
→ Loki (logs)
→ Prometheus (metrics)
→ Grafana (visualization)
→ AlertManager (alerting)

Opportunity

Modern LLMs can:

  • Understand stack traces and error messages
  • Correlate events across distributed systems
  • Reason about code behavior given context
  • Generate fixes based on patterns

Decision Drivers

  1. Mean Time to Resolution (MTTR): Reduce from hours to minutes
  2. On-Call Burden: Reduce toil for engineers
  3. Consistency: Apply same analysis rigor to all incidents
  4. Knowledge Capture: Encode institutional knowledge in prompts
  5. Safety: Prevent auto-resolution from causing more harm
  6. Cost: LLM API costs vs. engineer time savings
  7. Privacy: Ensure no PII leaks to LLM providers

Considered Options

Option A: Alert-Triggered Analysis (Reactive)

AlertManager → Webhook → AIOps Agent → LLM Analysis → Action

Flow:

  1. Grafana alert fires (error rate spike, latency threshold, etc.)
  2. Webhook triggers AIOps agent
  3. Agent gathers context (traces, logs, metrics, relevant code)
  4. LLM analyzes and produces root cause hypothesis
  5. Based on confidence, either:
    • Auto-apply safe fix (config change, restart)
    • Create PR for code fix
    • Alert on-call with analysis

Pros:

  • Simple integration with existing alerting
  • Only runs when needed (cost-efficient)
  • Clear trigger → action model

Cons:

  • Reactive, not predictive
  • Alert fatigue can delay analysis
  • Dependent on good alert configuration

Option B: Continuous Anomaly Detection (Proactive)

Telemetry Stream → ML Anomaly Detection → LLM Analysis → Action

Flow:

  1. ML model continuously analyzes telemetry streams
  2. Detects anomalies before they become incidents
  3. LLM investigates anomalies proactively
  4. Alerts or fixes before users are impacted

Pros:

  • Catches issues before user impact
  • Can detect subtle degradations
  • Learns normal patterns over time

Cons:

  • Higher compute/API costs (always running)
  • More false positives initially
  • Requires ML infrastructure (not just LLM)

Combine reactive alerts with proactive pattern analysis:

┌─────────────────────────────────────────────────────────────────┐
│ AIOps Control Plane │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Reactive │ │ Proactive │ │ On-Demand │ │
│ │ (Alerts) │ │ (Anomaly) │ │ (Manual) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Context Gatherer │ │
│ │ - Traces (Tempo) │ │
│ │ - Logs (Loki) │ │
│ │ - Metrics (Prom) │ │
│ │ - Code (GitHub) │ │
│ │ - History (DB) │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ LLM Analyzer │ │
│ │ - Root Cause │ │
│ │ - Impact Scope │ │
│ │ - Fix Proposal │ │
│ │ - Confidence │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ LLM Council │ │
│ │ (High Stakes) │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Action Router │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Auto-Apply │ │ Create PR │ │ Alert │ │
│ │ (Safe Ops) │ │ (Code Fix) │ │ (Human) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Proposed Architecture

Component 1: Context Gatherer

Collects relevant data when an investigation is triggered:

class ContextGatherer:
async def gather(self, trigger: InvestigationTrigger) -> IncidentContext:
return IncidentContext(
# Observability data
traces=await self.tempo.get_traces(trigger.trace_id, window="5m"),
logs=await self.loki.query(trigger.log_query, window="10m"),
metrics=await self.prometheus.query(trigger.metric_queries),

# Code context
stack_trace=trigger.stack_trace,
relevant_code=await self.github.get_files(
self.extract_files_from_trace(trigger.stack_trace)
),
recent_commits=await self.github.get_commits(window="24h"),

# Historical context
similar_incidents=await self.incident_db.find_similar(trigger),
runbooks=await self.runbook_db.find_relevant(trigger.error_type),

# System state
deployments=await self.argocd.get_recent_deployments(),
config_changes=await self.config_db.get_recent_changes(),
)

Component 2: LLM Analyzer

Analyzes context and produces structured output:

class LLMAnalyzer:
async def analyze(self, context: IncidentContext) -> Analysis:
prompt = self.build_prompt(context)

response = await self.llm.complete(
model="claude-sonnet-4-20250514", # Fast for initial analysis
system=AIOPS_SYSTEM_PROMPT,
messages=[{"role": "user", "content": prompt}],
response_format=AnalysisSchema # Structured output
)

return Analysis(
root_cause=response.root_cause,
root_cause_confidence=response.confidence,
impact_scope=response.impact_scope,
affected_users_estimate=response.affected_users,
proposed_fixes=response.fixes,
fix_confidence=response.fix_confidence,
requires_council_review=response.confidence < 0.8 or response.is_high_risk,
supporting_evidence=response.evidence,
)

Component 3: LLM Council Integration

For high-stakes decisions, use multi-model consensus:

class AIOpsCouncil:
async def review_fix(self, analysis: Analysis, fix: ProposedFix) -> CouncilVerdict:
return await self.council.consult(
query=f"""
## Incident Analysis
{analysis.to_markdown()}

## Proposed Fix
{fix.to_markdown()}

## Questions for Council
1. Is the root cause analysis correct?
2. Will the proposed fix resolve the issue?
3. What are the risks of applying this fix?
4. Should this be auto-applied or require human approval?
""",
confidence="high", # Full council deliberation
verdict_type="binary", # approve/reject
include_dissent=True,
)

Component 4: Action Router

Routes decisions based on policy scoring and deterministic verification (not raw LLM confidence):

class ActionRouter:
# Tier 1: Pre-approved safe operations (requires precondition verification)
SAFE_OPS = {
"restart_pod": {
"max_replicas": 1,
"cooldown": "5m",
"preconditions": ["not_during_deploy", "pod_healthy_recently", "no_pending_migrations"],
},
"scale_up": {
"max_increase": 2,
"max_total": 10,
"preconditions": ["under_resource_quota", "no_pending_scale_ops"],
},
"toggle_feature_flag": {
"flags": ["maintenance_mode"],
"preconditions": ["flag_in_allowlist"],
},
"rollback_config": {
"window": "1h",
"preconditions": ["config_has_previous_version", "no_dependent_changes"],
},
}

# Tier 3: Always require human approval (moved from Tier 1 per Council review)
HUMAN_REQUIRED_OPS = {
"clear_cache": {"reason": "Thundering herd risk - requires human judgment"},
"database_migration": {"reason": "Data integrity risk"},
"code_deployment": {"reason": "Broad impact"},
"security_fix": {"reason": "Security-sensitive"},
"data_modification": {"reason": "Data integrity risk"},
}

async def route(self, analysis: Analysis) -> ActionResult:
for fix in analysis.proposed_fixes:
# Calculate policy-based score (not raw LLM confidence)
policy_score = await self.calculate_policy_score(analysis, fix)

# Tier 1: Auto-apply safe operations with precondition verification
if fix.type in self.SAFE_OPS:
preconditions_met = await self.verify_preconditions(fix)
if not preconditions_met:
return await self.alert_oncall(analysis, fix,
reason="Preconditions not met")

if policy_score.auto_apply_eligible:
result = await self.auto_apply_with_circuit_breaker(fix)
# Post-action SLO verification
await self.verify_slo_after_action(result, rollback_on_failure=True)
return result

# Tier 2: Council review for code changes
elif fix.type == "code_change":
verdict = await self.council.review_fix(analysis, fix)
return await self.create_pr(fix, auto_merge=False,
label="needs-human-review")

# Tier 3: Alert humans for high-risk operations
return await self.alert_oncall(analysis, fix)

async def calculate_policy_score(self, analysis: Analysis, fix: ProposedFix) -> PolicyScore:
"""
Deterministic policy-based scoring instead of raw LLM confidence.
Factors: incident type, fix type, recent failure history, system state.
"""
return PolicyScore(
auto_apply_eligible=(
fix.type in self.SAFE_OPS
and analysis.incident_type in ["high_latency", "error_spike", "resource_exhaustion"]
and self.circuit_breaker.is_closed(fix.type)
and await self.no_recent_failures(fix.type, window="1h")
),
requires_council=fix.type == "code_change" or analysis.is_high_risk,
requires_human=fix.type in self.HUMAN_REQUIRED_OPS,
)

async def auto_apply_with_circuit_breaker(self, fix: ProposedFix) -> ActionResult:
"""Apply fix with circuit breaker protection."""
if self.circuit_breaker.is_open(fix.type):
raise CircuitBreakerOpenError(f"Too many recent failures for {fix.type}")

try:
result = await self.auto_apply(fix)
self.circuit_breaker.record_success(fix.type)
return result
except Exception as e:
self.circuit_breaker.record_failure(fix.type)
raise

async def verify_slo_after_action(self, result: ActionResult, rollback_on_failure: bool):
"""Verify SLOs are met after action, rollback if not."""
await asyncio.sleep(30) # Wait for metrics to stabilize
slo_check = await self.prometheus.check_slos(result.affected_service)

if not slo_check.passed and rollback_on_failure:
await self.rollback(result)
await self.alert_oncall(result.analysis, result.fix,
reason=f"Auto-rollback: SLO degradation detected")

Component 5: Feedback Loop

Learn from outcomes to improve:

class FeedbackCollector:
async def record_outcome(self, incident_id: str, outcome: Outcome):
await self.db.save({
"incident_id": incident_id,
"analysis": outcome.analysis,
"action_taken": outcome.action,
"resolution_time": outcome.resolution_time,
"was_correct": outcome.human_feedback, # Did the fix work?
"false_positive": outcome.was_false_positive,
})

# Periodic retraining of prompts based on outcomes
if await self.should_update_prompts():
await self.update_prompts_from_feedback()

Auto-Resolution Safety Matrix

Updated per LLM Council Review: Cache clearing moved to Tier 3 due to thundering herd risk. All Tier 1 operations now require precondition verification and post-action SLO checks.

Fix TypeTierAuto-Apply?PreconditionsRollbackSLO Check
Pod restart1YesNot during deploy, pod healthy recently, no migrationsN/AYes
Scale up1YesUnder quota, no pending scale opsScale downYes
Config rollback1YesPrevious version exists, no dependent changesRe-applyYes
Feature flag off1YesFlag in allowlistRe-enableYes
Cache clear3NoN/A - Requires human approval (thundering herd risk)N/AN/A
Code changes2NoCouncil review, PR createdRevert PRN/A
Database migration3NoAlways human approvalManualN/A
Security fix3NoAlert only, never auto-fixN/AN/A
Data modification3NoNever auto-applyN/AN/A

Circuit Breaker Configuration

CIRCUIT_BREAKER_CONFIG = {
"restart_pod": {"failure_threshold": 3, "reset_timeout": "30m"},
"scale_up": {"failure_threshold": 2, "reset_timeout": "1h"},
"rollback_config": {"failure_threshold": 2, "reset_timeout": "1h"},
"toggle_feature_flag": {"failure_threshold": 3, "reset_timeout": "30m"},
}

Privacy & Security Considerations

Data Sent to LLM

class PIISanitizer:
def sanitize(self, context: IncidentContext) -> SanitizedContext:
return SanitizedContext(
# Sanitize logs
logs=self.redact_pii(context.logs),

# Hash user identifiers
user_ids=self.hash_ids(context.affected_users),

# Remove secrets from stack traces
stack_trace=self.redact_secrets(context.stack_trace),

# Code is OK (already public or internal)
code=context.relevant_code,
)

PII_PATTERNS = [
r'\b[\w.-]+@[\w.-]+\.\w+\b', # Emails
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'Bearer\s+[A-Za-z0-9-_]+', # Tokens
r'password["\s:=]+\S+', # Passwords
]

Audit Trail

All AIOps actions are logged:

@dataclass
class AIOpsAuditLog:
timestamp: datetime
trigger_type: str # alert, anomaly, manual
trigger_id: str
context_hash: str # Hash of data sent to LLM
analysis_id: str
llm_model: str
council_verdict: Optional[CouncilVerdict]
action_taken: str
action_result: str
human_override: Optional[str]

Implementation Phases

Phase 1: Analysis Only (2 weeks) ✅ APPROVED

  • Build Context Gatherer
  • Implement LLM Analyzer with structured output
  • Create Slack/Linear integration for analysis reports
  • No auto-resolution, humans take action
  • Risk: Low - read-only analysis
  • Status: Approved by LLM Council

Phase 2: Safe Auto-Resolution (3 weeks) ⚠️ REQUIRES COUNCIL RE-REVIEW

Council Requirements: Must implement the following before Phase 2 deployment:

  • Implement policy-based scoring (replace raw LLM confidence thresholds)
  • Add precondition verification for all safe operations
  • Implement circuit breakers per operation type
  • Add post-action SLO verification with automatic rollback
  • Add pod restart, scale up (cache clear moved to Tier 3)
  • Add comprehensive audit logging
  • Create runbook for circuit breaker resets
  • Risk: Medium - limited blast radius with safety controls
  • Status: Blocked until safety controls implemented

Phase 3: Council-Reviewed Code Fixes (3 weeks)

  • Integrate LLM Council for fix review
  • Auto-create PRs with proposed fixes
  • Add feedback collection
  • Human approval required for merge
  • Risk: Medium - humans in the loop

Phase 4: Proactive Anomaly Detection (4 weeks)

  • Add ML-based anomaly detection
  • Trigger analysis before alerts fire
  • Learn baseline patterns per service
  • Reduce false positives over time
  • Risk: Medium - may generate noise initially

Cost Analysis

LLM API Costs (Estimated)

ScenarioIncidents/MonthTokens/IncidentCost/Month
Low volume5050,000~$50
Medium volume20050,000~$200
High volume1,00050,000~$1,000

ROI Calculation

MetricBefore AIOpsAfter AIOpsSavings
MTTR2 hours15 minutes87%
On-call pages/week20575%
Engineer hours/incident30.583%

Break-even: ~10 incidents/month at $100/engineer-hour

Success Metrics

MetricTargetMeasurement
Root cause accuracy>80%Human validation sampling
Auto-resolution success>95%Fix effectiveness tracking
MTTR reduction>50%Incident timeline analysis
False positive rate<10%Manual review of analyses
Engineer satisfaction>4/5Quarterly survey

Decision

Recommended: Option C (Hybrid Approach), implemented in phases starting with analysis-only.

Rationale

  1. Low risk start: Phase 1 is read-only, no production impact
  2. Incremental value: Each phase delivers measurable improvement
  3. Human oversight: Council review for high-stakes decisions
  4. Learning system: Feedback loop improves over time
  5. Cost-effective: Only runs when needed (alert-triggered)

Open Questions - Council Responses

These questions were reviewed by the LLM Council on 2026-01-20.

  1. Confidence thresholds: Are 0.9 (auto-apply) and 0.8 (council review) appropriate?

    Council Response: ❌ Raw LLM confidence scores are unreliable for automation decisions. Resolution: Replace with policy-based scoring using deterministic factors (incident type, fix type, recent failure history, system state).

  2. Safe operations list: Should any operations be added/removed from auto-apply?

    Council Response: ❌ Remove clear_cache - thundering herd risk makes it unsafe for auto-apply. Resolution: Cache clearing moved to Tier 3 (human required). Added precondition verification for remaining Tier 1 ops.

  3. Privacy concerns: Is the PII sanitization approach sufficient?

    Council Response: ✅ Acceptable for Phase 1. Consider additional patterns for API keys and OAuth tokens.

  4. Cost vs. value: Is the ROI analysis realistic?

    Council Response: ✅ ROI estimates are reasonable. Phase 1 (analysis-only) provides value with minimal cost.

  5. Failure modes: What happens if the LLM gives bad advice consistently?

    Council Response: ⚠️ Add circuit breakers to prevent cascading failures from bad recommendations. Resolution: Implemented circuit breaker pattern with failure thresholds and automatic disable.

References