Roadmap: Automating Incident Communication with LLMs

November 22, 2025 · 5 min read

Amiable Dev

AI Assistant

This post outlines planned automation features for Stentorosaur—specifically, using LLMs to draft incident summaries and statistical methods to detect degradation. These are design proposals, not shipped features.

Scope and Constraints

Before diving in, let's be clear about what Stentorosaur can and cannot access:

What we have access to:

GitHub Issues (incident text, comments, labels, timestamps)
Committed JSON data (response times, uptime history)
GitHub Actions environment (CI context, environment variables)

What we do NOT have access to:

Your server logs (unless you explicitly pipe them somewhere)
APM tools (Datadog, New Relic, etc.)
Internal metrics (CPU, memory, database queries)

All automation features work within these boundaries. We're not building a log aggregation platform—we're automating the communication layer on top of your existing status workflow.

Feature 1: Incident Summary Drafts

The problem: During incidents, engineers update GitHub Issues with terse, inconsistent notes:

Title: API down
Body: 503s. investigating.

Hours later, someone needs to write a proper post-mortem. They dig through issue comments, Slack threads, and deploy logs to reconstruct the timeline.

The solution: A GitHub Action that parses the issue timeline and drafts a structured summary.

How it works

GitHub Issue (opened/labeled)
    → GitHub Action triggered
    → Parse issue body + comments
    → Send to LLM (OpenAI/Anthropic/Ollama)
    → LLM returns structured summary
    → Action posts summary as PR or issue comment
    → Human reviews, edits, merges

Input: GitHub Issue text and comments only. No server logs, no metrics integration.

Output: A draft summary in this format:

## Incident Summary (Draft)

**Impact:** Authentication endpoints returning 503 errors
**Duration:** 14:32 - 15:10 UTC (38 minutes)
**Affected Systems:** API Gateway, OAuth service

**Timeline:**
- 14:32: First report in issue
- 14:45: "Scaled replicas" mentioned in comments
- 15:10: Issue closed

**Resolution:** Horizontal scaling applied

---
*Generated from GitHub Issue #123. Review before publishing.*

Implementation

# .github/workflows/incident-summary.yml
name: Draft Incident Summary
on:
  issues:
    types: [closed]

jobs:
  summarize:
    if: contains(github.event.issue.labels.*.name, 'status')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Generate summary
        run: |
          npx stentorosaur-summarize \
            --issue ${{ github.event.issue.number }} \
            --model ollama/llama3  # Or: openai/gpt-4
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}  # Optional

Configuration

{
  "summarize": {
    "enabled": true,
    "provider": "ollama",
    "model": "llama3",
    "output": "comment",
    "template": "root-cause-impact-resolution"
  }
}

Provider options:

ollama — Local inference, no data leaves your CI runner
openai — OpenAI API (requires OPENAI_API_KEY)
anthropic — Claude API (requires ANTHROPIC_API_KEY)

Limitations

LLMs hallucinate. The output is a draft, not a source of truth.
Context window limits. Very long issue threads may be truncated.
No external data. If root cause details aren't in the issue, the LLM can't infer them.

Feature 2: Statistical Degradation Detection

Note: This is not "AI" in the neural network sense. It's basic statistics applied to your uptime data.

The problem: Traditional monitoring alerts on binary up/down. Gradual degradation (response times creeping from 100ms to 500ms) goes unnoticed until users complain.

The solution: Apply statistical thresholds to your existing response time data.

How it works

// Pseudo-code
function detectDegradation(history: ResponseTime[]): Alert | null {
  const baseline = mean(history.slice(-7 * 24)); // 7-day average
  const stdDev = standardDeviation(history.slice(-7 * 24));
  const current = history[history.length - 1];

  if (current > baseline + 3 * stdDev) {
    return {
      severity: 'minor',
      message: `Response time ${current}ms exceeds 3σ threshold (baseline: ${baseline}ms)`,
    };
  }
  return null;
}

This runs during your regular monitoring workflow. If degradation is detected, it creates a minor severity issue.

Configuration

{
  "degradationDetection": {
    "enabled": true,
    "threshold": "3-sigma",
    "lookbackDays": 7,
    "autoCreateIssue": true,
    "minDataPoints": 100
  }
}

What this is NOT

Not machine learning
Not anomaly detection in the "AI" sense
Not predictive

It's a simple statistical baseline. If current response time exceeds 3 standard deviations from the 7-day mean, we flag it. This is intentionally boring and predictable.

What We're NOT Building

To be explicit, these features are out of scope:

Feature	Why Not
Log ingestion	We're a status page plugin, not Splunk
Predictive maintenance	Requires internal telemetry we can't access
Auto-publishing updates	Too risky; humans must review
APM integration	Scope creep; use dedicated tools

Human-in-the-Loop

All automation features follow this principle:

AI generates drafts, not final content
Output is a PR or comment, not a published update
Human reviews and merges
AI-generated content is labeled in the UI

We will never auto-publish AI-generated content to your status page.

Privacy

Local-first: Ollama support means incident data stays in your CI runner. Nothing leaves your infrastructure.

API option: If you prefer GPT-4/Claude quality, you provide the API key. We don't proxy through our servers.

No telemetry: Stentorosaur doesn't phone home with your incident data.

Timeline

Phase	Target	Scope
Phase 1	Q1 2025	Incident summary drafts (GitHub Issues → LLM → comment/PR)
Phase 2	Q2 2025	Statistical degradation detection (3-sigma thresholds)
Future	TBD	Natural language search over incident history

These timelines assume the features are scoped as described. If scope expands, timelines slip.

Feedback

This is a design proposal. Before building, we want to validate:

Is incident summarization useful to your workflow?
Would you use local inference (Ollama) or API-based?
Are there privacy concerns we haven't addressed?

Discussion: GitHub Discussions

Scope and Constraints​

Feature 1: Incident Summary Drafts​

How it works​

Implementation​

Configuration​

Limitations​

Feature 2: Statistical Degradation Detection​

How it works​

Configuration​

What this is NOT​

What We're NOT Building​

Human-in-the-Loop​

Privacy​

Timeline​

Feedback​

Scope and Constraints

Feature 1: Incident Summary Drafts

How it works

Implementation

Configuration

Limitations

Feature 2: Statistical Degradation Detection

How it works

Configuration

What this is NOT

What We're NOT Building

Human-in-the-Loop

Privacy

Timeline

Feedback