ADR-013: AIOps - LLM-Powered Root Cause Analysis and Auto-Resolution

Status

Conditionally Approved - Phase 1 (Analysis Only) approved. Phase 2+ requires addressing Council feedback.

LLM Council Review (2026-01-20)

Verdict: Conditional Approval (Phase 1 Only)

Key Findings:

Raw LLM confidence scores are unreliable for automation decisions
Cache clearing poses thundering herd risk - moved to Tier 3
Pod restarts require precondition verification
Post-action SLO verification is mandatory

Required Changes for Phase 2:

Replace LLM confidence thresholds with policy-based scoring
Add circuit breakers and remediation safety controller
Implement precondition verification for all safe ops
Add post-action SLO verification with automatic rollback

Date

2026-01-20

Context

Habit Hub uses Grafana Faro for Real User Monitoring (RUM) and OpenTelemetry for distributed tracing. Currently, when issues occur:

Alerts fire based on static thresholds
Engineers manually investigate traces, logs, and metrics
Root cause analysis is time-consuming and requires domain expertise
Resolution depends on engineer availability

Problem Statement: Can we leverage LLMs to automatically analyze observability data, identify root causes, and propose or execute fixes?

Current Observability Stack

Browser → Faro SDK → OTel Collector → Tempo (traces)
                                    → Loki (logs)
                                    → Prometheus (metrics)
                                    → Grafana (visualization)
                                    → AlertManager (alerting)

Opportunity

Modern LLMs can:

Understand stack traces and error messages
Correlate events across distributed systems
Reason about code behavior given context
Generate fixes based on patterns

Decision Drivers

Mean Time to Resolution (MTTR): Reduce from hours to minutes
On-Call Burden: Reduce toil for engineers
Consistency: Apply same analysis rigor to all incidents
Knowledge Capture: Encode institutional knowledge in prompts
Safety: Prevent auto-resolution from causing more harm
Cost: LLM API costs vs. engineer time savings
Privacy: Ensure no PII leaks to LLM providers

Considered Options

Option A: Alert-Triggered Analysis (Reactive)

AlertManager → Webhook → AIOps Agent → LLM Analysis → Action

Flow:

Grafana alert fires (error rate spike, latency threshold, etc.)
Webhook triggers AIOps agent
Agent gathers context (traces, logs, metrics, relevant code)
LLM analyzes and produces root cause hypothesis
Based on confidence, either:
- Auto-apply safe fix (config change, restart)
- Create PR for code fix
- Alert on-call with analysis

Pros:

Simple integration with existing alerting
Only runs when needed (cost-efficient)
Clear trigger → action model

Cons:

Reactive, not predictive
Alert fatigue can delay analysis
Dependent on good alert configuration

Option B: Continuous Anomaly Detection (Proactive)

Telemetry Stream → ML Anomaly Detection → LLM Analysis → Action

Flow:

ML model continuously analyzes telemetry streams
Detects anomalies before they become incidents
LLM investigates anomalies proactively
Alerts or fixes before users are impacted

Pros:

Catches issues before user impact
Can detect subtle degradations
Learns normal patterns over time

Cons:

Higher compute/API costs (always running)
More false positives initially
Requires ML infrastructure (not just LLM)

Option C: Hybrid Approach (Recommended)

Combine reactive alerts with proactive pattern analysis:

┌─────────────────────────────────────────────────────────────────┐
│                     AIOps Control Plane                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐     │
│  │   Reactive   │    │  Proactive   │    │   On-Demand  │     │
│  │   (Alerts)   │    │  (Anomaly)   │    │   (Manual)   │     │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘     │
│         │                   │                   │              │
│         └───────────────────┼───────────────────┘              │
│                             ▼                                   │
│                   ┌──────────────────┐                         │
│                   │  Context Gatherer │                         │
│                   │  - Traces (Tempo) │                         │
│                   │  - Logs (Loki)    │                         │
│                   │  - Metrics (Prom) │                         │
│                   │  - Code (GitHub)  │                         │
│                   │  - History (DB)   │                         │
│                   └────────┬─────────┘                         │
│                            ▼                                    │
│                   ┌──────────────────┐                         │
│                   │   LLM Analyzer   │                         │
│                   │  - Root Cause    │                         │
│                   │  - Impact Scope  │                         │
│                   │  - Fix Proposal  │                         │
│                   │  - Confidence    │                         │
│                   └────────┬─────────┘                         │
│                            ▼                                    │
│                   ┌──────────────────┐                         │
│                   │  LLM Council     │                         │
│                   │  (High Stakes)   │                         │
│                   └────────┬─────────┘                         │
│                            ▼                                    │
│                   ┌──────────────────┐                         │
│                   │  Action Router   │                         │
│                   └────────┬─────────┘                         │
│                            │                                    │
│         ┌──────────────────┼──────────────────┐                │
│         ▼                  ▼                  ▼                │
│  ┌────────────┐    ┌────────────┐    ┌────────────┐           │
│  │ Auto-Apply │    │ Create PR  │    │   Alert    │           │
│  │ (Safe Ops) │    │ (Code Fix) │    │  (Human)   │           │
│  └────────────┘    └────────────┘    └────────────┘           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Proposed Architecture

Component 1: Context Gatherer

Collects relevant data when an investigation is triggered:

class ContextGatherer:
    async def gather(self, trigger: InvestigationTrigger) -> IncidentContext:
        return IncidentContext(
            # Observability data
            traces=await self.tempo.get_traces(trigger.trace_id, window="5m"),
            logs=await self.loki.query(trigger.log_query, window="10m"),
            metrics=await self.prometheus.query(trigger.metric_queries),

            # Code context
            stack_trace=trigger.stack_trace,
            relevant_code=await self.github.get_files(
                self.extract_files_from_trace(trigger.stack_trace)
            ),
            recent_commits=await self.github.get_commits(window="24h"),

            # Historical context
            similar_incidents=await self.incident_db.find_similar(trigger),
            runbooks=await self.runbook_db.find_relevant(trigger.error_type),

            # System state
            deployments=await self.argocd.get_recent_deployments(),
            config_changes=await self.config_db.get_recent_changes(),
        )

Component 2: LLM Analyzer

Analyzes context and produces structured output:

class LLMAnalyzer:
    async def analyze(self, context: IncidentContext) -> Analysis:
        prompt = self.build_prompt(context)

        response = await self.llm.complete(
            model="claude-sonnet-4-20250514",  # Fast for initial analysis
            system=AIOPS_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": prompt}],
            response_format=AnalysisSchema  # Structured output
        )

        return Analysis(
            root_cause=response.root_cause,
            root_cause_confidence=response.confidence,
            impact_scope=response.impact_scope,
            affected_users_estimate=response.affected_users,
            proposed_fixes=response.fixes,
            fix_confidence=response.fix_confidence,
            requires_council_review=response.confidence < 0.8 or response.is_high_risk,
            supporting_evidence=response.evidence,
        )

Component 3: LLM Council Integration

For high-stakes decisions, use multi-model consensus:

class AIOpsCouncil:
    async def review_fix(self, analysis: Analysis, fix: ProposedFix) -> CouncilVerdict:
        return await self.council.consult(
            query=f"""
            ## Incident Analysis
            {analysis.to_markdown()}

            ## Proposed Fix
            {fix.to_markdown()}

            ## Questions for Council
            1. Is the root cause analysis correct?
            2. Will the proposed fix resolve the issue?
            3. What are the risks of applying this fix?
            4. Should this be auto-applied or require human approval?
            """,
            confidence="high",  # Full council deliberation
            verdict_type="binary",  # approve/reject
            include_dissent=True,
        )

Component 4: Action Router

Routes decisions based on policy scoring and deterministic verification (not raw LLM confidence):

class ActionRouter:
    # Tier 1: Pre-approved safe operations (requires precondition verification)
    SAFE_OPS = {
        "restart_pod": {
            "max_replicas": 1,
            "cooldown": "5m",
            "preconditions": ["not_during_deploy", "pod_healthy_recently", "no_pending_migrations"],
        },
        "scale_up": {
            "max_increase": 2,
            "max_total": 10,
            "preconditions": ["under_resource_quota", "no_pending_scale_ops"],
        },
        "toggle_feature_flag": {
            "flags": ["maintenance_mode"],
            "preconditions": ["flag_in_allowlist"],
        },
        "rollback_config": {
            "window": "1h",
            "preconditions": ["config_has_previous_version", "no_dependent_changes"],
        },
    }

    # Tier 3: Always require human approval (moved from Tier 1 per Council review)
    HUMAN_REQUIRED_OPS = {
        "clear_cache": {"reason": "Thundering herd risk - requires human judgment"},
        "database_migration": {"reason": "Data integrity risk"},
        "code_deployment": {"reason": "Broad impact"},
        "security_fix": {"reason": "Security-sensitive"},
        "data_modification": {"reason": "Data integrity risk"},
    }

    async def route(self, analysis: Analysis) -> ActionResult:
        for fix in analysis.proposed_fixes:
            # Calculate policy-based score (not raw LLM confidence)
            policy_score = await self.calculate_policy_score(analysis, fix)

            # Tier 1: Auto-apply safe operations with precondition verification
            if fix.type in self.SAFE_OPS:
                preconditions_met = await self.verify_preconditions(fix)
                if not preconditions_met:
                    return await self.alert_oncall(analysis, fix,
                                                    reason="Preconditions not met")

                if policy_score.auto_apply_eligible:
                    result = await self.auto_apply_with_circuit_breaker(fix)
                    # Post-action SLO verification
                    await self.verify_slo_after_action(result, rollback_on_failure=True)
                    return result

            # Tier 2: Council review for code changes
            elif fix.type == "code_change":
                verdict = await self.council.review_fix(analysis, fix)
                return await self.create_pr(fix, auto_merge=False,
                                            label="needs-human-review")

            # Tier 3: Alert humans for high-risk operations
            return await self.alert_oncall(analysis, fix)

    async def calculate_policy_score(self, analysis: Analysis, fix: ProposedFix) -> PolicyScore:
        """
        Deterministic policy-based scoring instead of raw LLM confidence.
        Factors: incident type, fix type, recent failure history, system state.
        """
        return PolicyScore(
            auto_apply_eligible=(
                fix.type in self.SAFE_OPS
                and analysis.incident_type in ["high_latency", "error_spike", "resource_exhaustion"]
                and self.circuit_breaker.is_closed(fix.type)
                and await self.no_recent_failures(fix.type, window="1h")
            ),
            requires_council=fix.type == "code_change" or analysis.is_high_risk,
            requires_human=fix.type in self.HUMAN_REQUIRED_OPS,
        )

    async def auto_apply_with_circuit_breaker(self, fix: ProposedFix) -> ActionResult:
        """Apply fix with circuit breaker protection."""
        if self.circuit_breaker.is_open(fix.type):
            raise CircuitBreakerOpenError(f"Too many recent failures for {fix.type}")

        try:
            result = await self.auto_apply(fix)
            self.circuit_breaker.record_success(fix.type)
            return result
        except Exception as e:
            self.circuit_breaker.record_failure(fix.type)
            raise

    async def verify_slo_after_action(self, result: ActionResult, rollback_on_failure: bool):
        """Verify SLOs are met after action, rollback if not."""
        await asyncio.sleep(30)  # Wait for metrics to stabilize
        slo_check = await self.prometheus.check_slos(result.affected_service)

        if not slo_check.passed and rollback_on_failure:
            await self.rollback(result)
            await self.alert_oncall(result.analysis, result.fix,
                                    reason=f"Auto-rollback: SLO degradation detected")

Component 5: Feedback Loop

Learn from outcomes to improve:

class FeedbackCollector:
    async def record_outcome(self, incident_id: str, outcome: Outcome):
        await self.db.save({
            "incident_id": incident_id,
            "analysis": outcome.analysis,
            "action_taken": outcome.action,
            "resolution_time": outcome.resolution_time,
            "was_correct": outcome.human_feedback,  # Did the fix work?
            "false_positive": outcome.was_false_positive,
        })

        # Periodic retraining of prompts based on outcomes
        if await self.should_update_prompts():
            await self.update_prompts_from_feedback()

Auto-Resolution Safety Matrix

Updated per LLM Council Review: Cache clearing moved to Tier 3 due to thundering herd risk. All Tier 1 operations now require precondition verification and post-action SLO checks.

Fix Type	Tier	Auto-Apply?	Preconditions	Rollback	SLO Check
Pod restart	1	Yes	Not during deploy, pod healthy recently, no migrations	N/A	Yes
Scale up	1	Yes	Under quota, no pending scale ops	Scale down	Yes
Config rollback	1	Yes	Previous version exists, no dependent changes	Re-apply	Yes
Feature flag off	1	Yes	Flag in allowlist	Re-enable	Yes
Cache clear	3	No	N/A - Requires human approval (thundering herd risk)	N/A	N/A
Code changes	2	No	Council review, PR created	Revert PR	N/A
Database migration	3	No	Always human approval	Manual	N/A
Security fix	3	No	Alert only, never auto-fix	N/A	N/A
Data modification	3	No	Never auto-apply	N/A	N/A

Circuit Breaker Configuration

CIRCUIT_BREAKER_CONFIG = {
    "restart_pod": {"failure_threshold": 3, "reset_timeout": "30m"},
    "scale_up": {"failure_threshold": 2, "reset_timeout": "1h"},
    "rollback_config": {"failure_threshold": 2, "reset_timeout": "1h"},
    "toggle_feature_flag": {"failure_threshold": 3, "reset_timeout": "30m"},
}

Privacy & Security Considerations

Data Sent to LLM

class PIISanitizer:
    def sanitize(self, context: IncidentContext) -> SanitizedContext:
        return SanitizedContext(
            # Sanitize logs
            logs=self.redact_pii(context.logs),

            # Hash user identifiers
            user_ids=self.hash_ids(context.affected_users),

            # Remove secrets from stack traces
            stack_trace=self.redact_secrets(context.stack_trace),

            # Code is OK (already public or internal)
            code=context.relevant_code,
        )

    PII_PATTERNS = [
        r'\b[\w.-]+@[\w.-]+\.\w+\b',  # Emails
        r'\b\d{3}-\d{2}-\d{4}\b',      # SSN
        r'Bearer\s+[A-Za-z0-9-_]+',    # Tokens
        r'password["\s:=]+\S+',         # Passwords
    ]

Audit Trail

All AIOps actions are logged:

@dataclass
class AIOpsAuditLog:
    timestamp: datetime
    trigger_type: str  # alert, anomaly, manual
    trigger_id: str
    context_hash: str  # Hash of data sent to LLM
    analysis_id: str
    llm_model: str
    council_verdict: Optional[CouncilVerdict]
    action_taken: str
    action_result: str
    human_override: Optional[str]

Implementation Phases

Phase 1: Analysis Only (2 weeks) ✅ APPROVED

Build Context Gatherer
Implement LLM Analyzer with structured output
Create Slack/Linear integration for analysis reports
No auto-resolution, humans take action
Risk: Low - read-only analysis
Status: Approved by LLM Council

Phase 2: Safe Auto-Resolution (3 weeks) ⚠️ REQUIRES COUNCIL RE-REVIEW

Council Requirements: Must implement the following before Phase 2 deployment:

Implement policy-based scoring (replace raw LLM confidence thresholds)
Add precondition verification for all safe operations
Implement circuit breakers per operation type
Add post-action SLO verification with automatic rollback
Add pod restart, scale up (cache clear moved to Tier 3)
Add comprehensive audit logging
Create runbook for circuit breaker resets
Risk: Medium - limited blast radius with safety controls
Status: Blocked until safety controls implemented

Phase 3: Council-Reviewed Code Fixes (3 weeks)

Integrate LLM Council for fix review
Auto-create PRs with proposed fixes
Add feedback collection
Human approval required for merge
Risk: Medium - humans in the loop

Phase 4: Proactive Anomaly Detection (4 weeks)

Add ML-based anomaly detection
Trigger analysis before alerts fire
Learn baseline patterns per service
Reduce false positives over time
Risk: Medium - may generate noise initially

Cost Analysis

LLM API Costs (Estimated)

Scenario	Incidents/Month	Tokens/Incident	Cost/Month
Low volume	50	50,000	~$50
Medium volume	200	50,000	~$200
High volume	1,000	50,000	~$1,000

ROI Calculation

Metric	Before AIOps	After AIOps	Savings
MTTR	2 hours	15 minutes	87%
On-call pages/week	20	5	75%
Engineer hours/incident	3	0.5	83%

Break-even: ~10 incidents/month at $100/engineer-hour

Success Metrics

Metric	Target	Measurement
Root cause accuracy	>80%	Human validation sampling
Auto-resolution success	>95%	Fix effectiveness tracking
MTTR reduction	>50%	Incident timeline analysis
False positive rate	<10%	Manual review of analyses
Engineer satisfaction	>4/5	Quarterly survey

Decision

Recommended: Option C (Hybrid Approach), implemented in phases starting with analysis-only.

Rationale

Low risk start: Phase 1 is read-only, no production impact
Incremental value: Each phase delivers measurable improvement
Human oversight: Council review for high-stakes decisions
Learning system: Feedback loop improves over time
Cost-effective: Only runs when needed (alert-triggered)

Open Questions - Council Responses

These questions were reviewed by the LLM Council on 2026-01-20.

Confidence thresholds: Are 0.9 (auto-apply) and 0.8 (council review) appropriate?

Council Response: ❌ Raw LLM confidence scores are unreliable for automation decisions. Resolution: Replace with policy-based scoring using deterministic factors (incident type, fix type, recent failure history, system state).
Safe operations list: Should any operations be added/removed from auto-apply?

Council Response: ❌ Remove clear_cache - thundering herd risk makes it unsafe for auto-apply. Resolution: Cache clearing moved to Tier 3 (human required). Added precondition verification for remaining Tier 1 ops.
Privacy concerns: Is the PII sanitization approach sufficient?

Council Response: ✅ Acceptable for Phase 1. Consider additional patterns for API keys and OAuth tokens.
Cost vs. value: Is the ROI analysis realistic?

Council Response: ✅ ROI estimates are reasonable. Phase 1 (analysis-only) provides value with minimal cost.
Failure modes: What happens if the LLM gives bad advice consistently?

Council Response: ⚠️ Add circuit breakers to prevent cascading failures from bad recommendations. Resolution: Implemented circuit breaker pattern with failure thresholds and automatic disable.

References

Grafana Faro Documentation
OpenTelemetry Specification
Google SRE Book - Incident Response
ADR-012: Security Scanning Strategy
OBSERVABILITY_IMPLEMENTATION.md

Status​

LLM Council Review (2026-01-20)​

Date​

Context​

Current Observability Stack​

Opportunity​

Decision Drivers​

Considered Options​

Option A: Alert-Triggered Analysis (Reactive)​

Option B: Continuous Anomaly Detection (Proactive)​

Option C: Hybrid Approach (Recommended)​

Proposed Architecture​

Component 1: Context Gatherer​

Component 2: LLM Analyzer​

Component 3: LLM Council Integration​

Component 4: Action Router​

Component 5: Feedback Loop​

Auto-Resolution Safety Matrix​

Circuit Breaker Configuration​

Privacy & Security Considerations​

Data Sent to LLM​

Audit Trail​

Implementation Phases​

Phase 1: Analysis Only (2 weeks) ✅ APPROVED​

Phase 2: Safe Auto-Resolution (3 weeks) ⚠️ REQUIRES COUNCIL RE-REVIEW​

Phase 3: Council-Reviewed Code Fixes (3 weeks)​

Phase 4: Proactive Anomaly Detection (4 weeks)​

Cost Analysis​

LLM API Costs (Estimated)​

ROI Calculation​

Success Metrics​

Decision​

Rationale​

Open Questions - Council Responses​

References​