ADR-014: Entity Counts Batching

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

Frontend Team - Performance requirements
Backend Team - API optimization

Layer

Caching

ADR-011: Redis Single Instance Strategy
ADR-007: React Query for Server State

Supersedes

Individual entity count endpoints

Depends On

ADR-011: Redis Single Instance Strategy

Context

The sidebar displays entity counts for all 17 entity types, causing:

N+1 Problem: 17 separate API calls on page load
Database Load: 17 COUNT(*) queries per page view
Latency: Sequential requests add up
Rate Limiting: May trigger limits on high traffic
Cache Inefficiency: Each count cached separately

Performance before optimization:

14 separate API calls per page load
~300ms total response time
Database hit on every uncached request

Decision

We implement batched entity counts with aggressive caching:

Key Design Decisions

Single Endpoint: /api/v1/entity-counts returns all counts
Aggressive Caching: 60-second TTL in Redis
Background Refresh: Cache warmed proactively
React Query Integration: Single query key for all counts
Fallback: Individual queries if batch fails

API Response

GET /api/v1/entity-counts

{
  "counts": {
    "requirements": 156,
    "capabilities": 42,
    "processes": 23,
    "applications": 87,
    "slis": 34,
    "slos": 28,
    ...
  },
  "cached_at": "2025-01-16T10:30:00Z",
  "cache_ttl": 60
}

Backend Implementation

@router.get("/entity-counts")
async def get_entity_counts(
    db: Session = Depends(get_db),
    cache: CacheManager = Depends(get_cache)
):
    # Check cache first
    cached = await cache.get("entity:counts:all")
    if cached:
        return cached

    # Batch query all counts
    counts = {}
    for entity_type, model in ENTITY_MODELS.items():
        counts[entity_type] = db.query(model).filter(
            model.is_deleted == False
        ).count()

    result = {"counts": counts, "cached_at": datetime.utcnow()}
    await cache.set("entity:counts:all", result, ttl=60)
    return result

Frontend Integration

const { data: counts } = useQuery({
  queryKey: ['entity-counts'],
  queryFn: () => api.get('/api/v1/entity-counts'),
  staleTime: 30_000,  // 30 seconds client-side
  refetchInterval: 60_000,  // Refresh every minute
});

Consequences

Positive

93% Request Reduction: 14 → 1 request per page load
100x Performance: 300ms → 3ms response time
Database Relief: Single batch query vs 14 individual
Cache Efficiency: One cache entry vs 14
Simpler Frontend: Single query subscription

Negative

All-or-Nothing: Can't refresh single count
Stale Data: Up to 60 seconds behind reality
Memory: Larger cache entry
Coupling: All counts tied together

Neutral

Cache Invalidation: Count changes don't immediately reflect
Error Handling: Batch failure more impactful

Alternatives Considered

1. GraphQL Batching

Approach: GraphQL query for multiple counts
Rejected: GraphQL disabled due to conflicts

2. SSE Count Updates

Approach: Server-sent events for real-time counts
Rejected: Complexity, not needed for counts

3. Per-Entity Caching

Approach: Cache each count with short TTL
Rejected: Still 14 requests, less efficient

Implementation Status

Implementation Details

Endpoint: backend/api/v1/entity_counts.py
Cache Key: entity:counts:all
Frontend Hook: frontend/src/hooks/useEntityCounts.ts
Sidebar: frontend/src/components/layout/Sidebar.tsx

Compliance/Validation

Automated checks: Response time monitoring
Manual review: Count accuracy verified
Metrics: Cache hit rate, response time via Prometheus

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH MODIFICATIONS

Quality Metrics

Consensus Strength Score (CSS): 0.90
Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

Batching is approved as correct, but the 60-second cache TTL is dangerous for SRE workflows. A 1-2 minute lag allows incidents to escalate while dashboards show "All Green."

Key Concerns Identified:

Compound Staleness: 60s Redis TTL + 60s React Query refetch = worst-case 120 seconds lag
One Size Doesn't Fit All: Treating Active Alerts the same as Total Teams is a fundamental error
Coupling Risk: If one expensive query hangs, it blocks all critical operational data

Required Modifications:

Tiered Caching ("Volatility Matrix"):
- Critical (Alerts, Incidents): TTL 5-10 seconds
- Operational (Deployments, On-call): TTL 30 seconds
- Reference (Inventory, Costs): TTL 60s-5min
Partial Success Logic: Return 200 OK with null/error for failed entities, don't fail entire request
Fix Client-Side: Reduce staleTime for critical data to 0-5s
UI Transparency: Display "As of" timestamps so SREs know data age
Event-Driven Invalidation: Use event hooks to immediately invalidate cache on new Incidents

Modifications Applied

Documented tiered caching strategy
Added partial success response pattern
Added "Force Refresh" button recommendation
Documented data age indicator requirement

Council Ranking

gpt-5.2: Best Response (volatility matrix)
claude-opus-4.5: Strong (UI transparency)
gemini-3-pro: Good (event invalidation)

References

API Batching Patterns
Performance Benchmark: backend/docs/architecture/redis-implementation-complete.md

ADR-014 | Caching Layer | Implemented

Status​

Date​

Decision Makers​

Layer​

Related ADRs​

Supersedes​

Depends On​

Context​

Decision​

Key Design Decisions​

API Response​

Backend Implementation​

Frontend Integration​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

1. GraphQL Batching​

2. SSE Count Updates​

3. Per-Entity Caching​

Implementation Status​

Implementation Details​

Compliance/Validation​

LLM Council Review​

Quality Metrics​

Council Feedback Summary​

Key Concerns Identified:​

Required Modifications:​

Modifications Applied​

Council Ranking​

References​