Skip to main content

Solving the 84% Empty Heatmap: Historical Data Aggregation in Stentorosaur

· 8 min read
Chris
Amiable Dev
Claude
AI Assistant

Our 90-day status heatmap looked great in mockups. In production, it was 84% empty. Here's how we fixed it with daily summary aggregation.

The Problem: 84% Empty

Stentorosaur's status page includes a 90-day heatmap showing system health over time. Each cell represents a day, colored by uptime percentage. The component rendered beautifully—except it only showed 14 days of actual data.

Day 1  Day 14                                           Day 90
[██████████████] [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]
↑ ↑
Actual data Empty (no data)
16% 84%

Root cause: The heatmap read from current.json, which contains a rolling 14-day window. Individual health check readings aren't aggregated—they're just raw timestamps and latencies.

Loading 90 days of raw JSONL archives on every page load wasn't an option:

  • 90 days × 144 checks/day × 5 systems = 64,800 entries
  • ~2.5MB uncompressed JSON
  • 3-5 second load times on mobile

We needed aggregated daily summaries.

Solution Landscape

We evaluated four approaches:

OptionDescriptionTrade-offs
A. Expand current.jsonKeep 90 days in rolling windowFile grows to 2MB+, slow loads
B. Build-time aggregationAggregate in Docusaurus pluginWorks, but requires rebuild for updates
C. Client-side aggregationFetch archives, aggregate in browser2MB+ download, 3-5s compute
D. Daily summary filePre-aggregate to daily-summary.json~15KB file, fast loads

We chose Option D: generate a daily-summary.json file during each monitoring run.

The Daily Summary Schema

{
"version": 1,
"lastUpdated": "2026-01-01T12:00:00Z",
"windowDays": 90,
"services": {
"api": [
{
"date": "2025-12-31",
"uptimePct": 0.993,
"avgLatencyMs": 145,
"p95LatencyMs": 320,
"checksTotal": 144,
"checksPassed": 143,
"incidentCount": 1
}
],
"website": [...]
}
}

Each entry is ~100 bytes. 90 days × 5 services = ~45KB. Compression brings it under 15KB.

Key design decisions:

  1. P95 latency reveals consistent slowness: Averages mask tail latency. If 10% of requests are slow (14/144 checks), P95 surfaces that degradation while averages smooth it over. For catching rare single-check failures, we also track Max latency in the raw data.

  2. Incident count per day: Counting up→down transitions surfaces flapping issues that uptime percentage masks. A service with 99% uptime but 10 incidents is worse than one with 99% uptime and 1 long outage.

  3. Schema versioning: The version field enables future migrations without breaking old clients.

  4. UTC for all dates: All date keys use UTC to avoid timezone boundary confusion. A user in Tokyo and a server in Oregon both reference the same "2025-12-31" day.

The Stale Today Problem

Here's a subtle bug we almost shipped:

Daily Summary (generated at midnight UTC):
2025-12-31: uptimePct=0.993

Current time: 2025-12-31 at 6pm UTC
Today's reality: 3 more outages since summary was generated

If we only read daily-summary.json, today's data is stale. The solution: hybrid read pattern.

Hybrid Read Pattern

The useDailySummary hook fetches both files in parallel:

export function useDailySummary(options: UseDailySummaryOptions): UseDailySummaryResult {
const { baseUrl, serviceName, days = 90 } = options;

useEffect(() => {
// Fetch both files in parallel
const [summaryResponse, currentResponse] = await Promise.all([
fetch(`${baseUrl}/daily-summary.json`).catch(() => null),
fetch(`${baseUrl}/current.json`).catch(() => null),
]);

// Handle responses...
}, [baseUrl]);

// Merge: today from current.json, history from summary
const mergedData = useMemo(() => {
const today = new Date().toISOString().split('T')[0];
const entries: DailySummaryEntry[] = [];

// Aggregate today's readings from current.json
const todayReadings = groupReadingsByDate(currentData, serviceName).get(today);
if (todayReadings?.length > 0) {
entries.push(aggregateDayReadings(today, todayReadings));
}

// Add historical entries (excluding today if present)
for (const entry of historicalEntries) {
if (entry.date !== today) {
entries.push(entry);
}
}

return entries.slice(0, days);
}, [summaryData, currentData, serviceName, days]);

return { data: mergedData, loading, error, lastUpdated };
}

Key behaviors:

  1. Today always comes from current.json: Real-time readings, not stale aggregates
  2. History comes from daily-summary.json: Pre-aggregated, fast to load
  3. Graceful fallback: If summary fails, show 14 days from current.json only

Monitor Script Integration

The stentorosaur-monitor CLI now generates daily-summary.json after each health check.

Aggregation Logic

The core aggregation converts raw readings into a daily summary:

function aggregateDayReadings(date, readings) {
const checksTotal = readings.length;
const checksPassed = readings.filter(
r => r.state === 'up' || r.state === 'maintenance'
).length;
const uptimePct = checksTotal > 0 ? checksPassed / checksTotal : 0;

// Only include latency from successful checks
const latencies = readings
.filter(r => r.state === 'up')
.map(r => r.lat)
.sort((a, b) => a - b);

const avgLatencyMs = latencies.length > 0
? Math.round(latencies.reduce((sum, lat) => sum + lat, 0) / latencies.length)
: null;

// P95: 95th percentile (excludes top 5% outliers)
const p95LatencyMs = latencies.length > 0
? latencies[Math.ceil(latencies.length * 0.95) - 1]
: null;

// Count incidents: up→down transitions
let incidentCount = 0;
for (let i = 1; i < readings.length; i++) {
if (readings[i - 1].state === 'up' && readings[i].state === 'down') {
incidentCount++;
}
}

return { date, uptimePct, avgLatencyMs, p95LatencyMs, checksTotal, checksPassed, incidentCount };
}

Summary Generation

The main generation function reads archives and aggregates by service/date:

function generateDailySummary(archivesDir, outputDir, windowDays = 90) {
const cutoffDate = new Date();
cutoffDate.setDate(cutoffDate.getDate() - windowDays);

// Collect readings from archives
const serviceReadings = new Map();

for (const archiveFile of getArchiveFiles(archivesDir, cutoffDate)) {
const readings = parseJsonlFile(archiveFile);
for (const reading of readings) {
// Group by service and date (UTC)
const dateKey = new Date(reading.t).toISOString().split('T')[0];
const key = `${reading.svc}:${dateKey}`;
if (!serviceReadings.has(key)) {
serviceReadings.set(key, []);
}
serviceReadings.get(key).push(reading);
}
}

// Aggregate to daily summaries
const services = {};
for (const [key, readings] of serviceReadings) {
const [svc, date] = key.split(':');
if (!services[svc]) services[svc] = [];
// Sort readings by timestamp before aggregating (for incident counting)
readings.sort((a, b) => a.t - b.t);
services[svc].push(aggregateDayReadings(date, readings));
}

// Sort each service's entries by date descending
for (const svc of Object.keys(services)) {
services[svc].sort((a, b) => b.date.localeCompare(a.date));
}

const summary = {
version: 1,
lastUpdated: new Date().toISOString(),
windowDays,
services,
};

// Atomic write: temp file then rename
const tmpPath = path.join(outputDir, 'daily-summary.tmp');
const finalPath = path.join(outputDir, 'daily-summary.json');
fs.writeFileSync(tmpPath, JSON.stringify(summary, null, 2));
fs.renameSync(tmpPath, finalPath);
}

The atomic write pattern (write to .tmp, then rename) prevents serving partial files if the process crashes mid-write.

Bootstrap Script

Existing Stentorosaur users have months of archive data but no daily-summary.json. The bootstrap script backfills:

npx stentorosaur-bootstrap-summary \
--archives-dir status-data/archives \
--output-dir status-data \
--window 90

Run once during upgrade to v0.17.0. After that, the monitor workflow maintains it automatically.

Performance Results

Comparing the naive approach (loading raw archives) vs. the summary approach:

MetricNaive (raw archives)Optimized (summary)
Data loaded2.5MB (90 days raw)15KB (summary)
Parse time3-5s (mobile)under 50ms
Heatmap coverage90 days90 days
First contentful paint~4s~1s

The "naive" column is what we would have shipped if we loaded raw JSONL archives client-side. Instead, we pre-aggregate server-side and ship a tiny summary file.

TypeScript Types

For consumers building on this data:

export interface DailySummaryEntry {
date: string; // "2025-12-31"
uptimePct: number; // 0.0 to 1.0
avgLatencyMs: number | null;
p95LatencyMs: number | null;
checksTotal: number;
checksPassed: number;
incidentCount: number;
}

export interface DailySummaryFile {
version: number;
lastUpdated: string; // ISO timestamp
windowDays: number;
services: Record<string, DailySummaryEntry[]>;
}

Configuration

Enable 90-day heatmaps by providing a data source URL:

// docusaurus.config.js
{
plugins: [
['@amiable-dev/docusaurus-plugin-stentorosaur', {
dataSource: {
strategy: 'github',
owner: 'your-org',
repo: 'your-repo',
branch: 'status-data',
},
// StatusItem will automatically use daily-summary.json
}],
],
}

The StatusItem component now accepts dataBaseUrl and heatmapDays props:

<StatusItem
item={systemStatus}
dataBaseUrl="/status-data" // Enables 90-day heatmap
heatmapDays={90} // Override default (14)
/>

Caching Considerations

A 15KB static JSON file is small, but caching affects freshness:

  • CDN caching: If your CDN caches daily-summary.json for 24 hours, users see stale historical data. Set Cache-Control: max-age=300 (5 minutes) or use cache invalidation on update.
  • Browser caching: The hybrid pattern helps here—even if the summary is cached, current.json provides fresh "today" data.
  • ETag support: Consider adding ETag headers so clients can do conditional fetches without downloading unchanged content.

For GitHub Pages / raw.githubusercontent.com, caching is automatic with ~5 minute TTL, which works well for status pages.

Lessons Learned

  1. Aggregate at write time, not read time: Moving computation from the browser to the monitoring workflow is a classic producer-side aggregation pattern. The monitoring job does the work once; every client benefits.

  2. Hybrid reads solve staleness: Never trust a single data source for time-sensitive data. Merge live and historical sources to get both freshness and efficiency.

  3. Percentiles reveal what averages hide: A day with 10% slow requests looks fine at avg=150ms but terrible at P95=2000ms. Track both, display the one that matters for your users.

  4. Schema versioning from day one: Adding version: 1 costs nothing and saves future migration pain.

  5. Atomic writes prevent corruption: Write to a temp file, then rename. This prevents clients from fetching half-written JSON during the monitoring run.

Upgrade Path

Stentorosaur v0.17.0 includes daily summary aggregation. To upgrade:

  1. Update the plugin: npm install @amiable-dev/docusaurus-plugin-stentorosaur@latest
  2. Run the bootstrap script once (if you have existing data)
  3. Heatmaps automatically expand to 90 days

The monitoring workflow handles summary generation automatically. No workflow changes required.

Summary

The 84% empty heatmap was a data architecture problem, not a UI bug. By pre-aggregating daily summaries and implementing a hybrid read pattern, we achieved:

  • Full 90-day heatmap coverage
  • Sub-50ms data loading
  • Real-time accuracy for today's data
  • Graceful fallbacks when data is missing

The pattern applies anywhere you need historical aggregates with live updates: analytics dashboards, SLA reports, or any time-series visualization.