Roadmap: Automating Incident Communication with LLMs
This post outlines planned automation features for Stentorosaur—specifically, using LLMs to draft incident summaries and statistical methods to detect degradation. These are design proposals, not shipped features.
Scope and Constraints
Before diving in, let's be clear about what Stentorosaur can and cannot access:
What we have access to:
- GitHub Issues (incident text, comments, labels, timestamps)
- Committed JSON data (response times, uptime history)
- GitHub Actions environment (CI context, environment variables)
What we do NOT have access to:
- Your server logs (unless you explicitly pipe them somewhere)
- APM tools (Datadog, New Relic, etc.)
- Internal metrics (CPU, memory, database queries)
All automation features work within these boundaries. We're not building a log aggregation platform—we're automating the communication layer on top of your existing status workflow.
Feature 1: Incident Summary Drafts
The problem: During incidents, engineers update GitHub Issues with terse, inconsistent notes:
Title: API down
Body: 503s. investigating.
Hours later, someone needs to write a proper post-mortem. They dig through issue comments, Slack threads, and deploy logs to reconstruct the timeline.
The solution: A GitHub Action that parses the issue timeline and drafts a structured summary.
How it works
GitHub Issue (opened/labeled)
→ GitHub Action triggered
→ Parse issue body + comments
→ Send to LLM (OpenAI/Anthropic/Ollama)
→ LLM returns structured summary
→ Action posts summary as PR or issue comment
→ Human reviews, edits, merges
Input: GitHub Issue text and comments only. No server logs, no metrics integration.
Output: A draft summary in this format:
## Incident Summary (Draft)
**Impact:** Authentication endpoints returning 503 errors
**Duration:** 14:32 - 15:10 UTC (38 minutes)
**Affected Systems:** API Gateway, OAuth service
**Timeline:**
- 14:32: First report in issue
- 14:45: "Scaled replicas" mentioned in comments
- 15:10: Issue closed
**Resolution:** Horizontal scaling applied
---
*Generated from GitHub Issue #123. Review before publishing.*
Implementation
# .github/workflows/incident-summary.yml
name: Draft Incident Summary
on:
issues:
types: [closed]
jobs:
summarize:
if: contains(github.event.issue.labels.*.name, 'status')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Generate summary
run: |
npx stentorosaur-summarize \
--issue ${{ github.event.issue.number }} \
--model ollama/llama3 # Or: openai/gpt-4
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # Optional
Configuration
{
"summarize": {
"enabled": true,
"provider": "ollama",
"model": "llama3",
"output": "comment",
"template": "root-cause-impact-resolution"
}
}
Provider options:
ollama— Local inference, no data leaves your CI runneropenai— OpenAI API (requiresOPENAI_API_KEY)anthropic— Claude API (requiresANTHROPIC_API_KEY)
Limitations
- LLMs hallucinate. The output is a draft, not a source of truth.
- Context window limits. Very long issue threads may be truncated.
- No external data. If root cause details aren't in the issue, the LLM can't infer them.
Feature 2: Statistical Degradation Detection
Note: This is not "AI" in the neural network sense. It's basic statistics applied to your uptime data.
The problem: Traditional monitoring alerts on binary up/down. Gradual degradation (response times creeping from 100ms to 500ms) goes unnoticed until users complain.
The solution: Apply statistical thresholds to your existing response time data.
How it works
// Pseudo-code
function detectDegradation(history: ResponseTime[]): Alert | null {
const baseline = mean(history.slice(-7 * 24)); // 7-day average
const stdDev = standardDeviation(history.slice(-7 * 24));
const current = history[history.length - 1];
if (current > baseline + 3 * stdDev) {
return {
severity: 'minor',
message: `Response time ${current}ms exceeds 3σ threshold (baseline: ${baseline}ms)`,
};
}
return null;
}
This runs during your regular monitoring workflow. If degradation is detected, it creates a minor severity issue.
Configuration
{
"degradationDetection": {
"enabled": true,
"threshold": "3-sigma",
"lookbackDays": 7,
"autoCreateIssue": true,
"minDataPoints": 100
}
}
What this is NOT
- Not machine learning
- Not anomaly detection in the "AI" sense
- Not predictive
It's a simple statistical baseline. If current response time exceeds 3 standard deviations from the 7-day mean, we flag it. This is intentionally boring and predictable.
What We're NOT Building
To be explicit, these features are out of scope:
| Feature | Why Not |
|---|---|
| Log ingestion | We're a status page plugin, not Splunk |
| Predictive maintenance | Requires internal telemetry we can't access |
| Auto-publishing updates | Too risky; humans must review |
| APM integration | Scope creep; use dedicated tools |
Human-in-the-Loop
All automation features follow this principle:
- AI generates drafts, not final content
- Output is a PR or comment, not a published update
- Human reviews and merges
- AI-generated content is labeled in the UI
We will never auto-publish AI-generated content to your status page.
Privacy
Local-first: Ollama support means incident data stays in your CI runner. Nothing leaves your infrastructure.
API option: If you prefer GPT-4/Claude quality, you provide the API key. We don't proxy through our servers.
No telemetry: Stentorosaur doesn't phone home with your incident data.
Timeline
| Phase | Target | Scope |
|---|---|---|
| Phase 1 | Q1 2025 | Incident summary drafts (GitHub Issues → LLM → comment/PR) |
| Phase 2 | Q2 2025 | Statistical degradation detection (3-sigma thresholds) |
| Future | TBD | Natural language search over incident history |
These timelines assume the features are scoped as described. If scope expands, timelines slip.
Feedback
This is a design proposal. Before building, we want to validate:
- Is incident summarization useful to your workflow?
- Would you use local inference (Ollama) or API-based?
- Are there privacy concerns we haven't addressed?
Discussion: GitHub Discussions

