ADR-014: Entity Counts Batching
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- Frontend Team - Performance requirements
- Backend Team - API optimization
Layer
Caching
Related ADRs
- ADR-011: Redis Single Instance Strategy
- ADR-007: React Query for Server State
Supersedes
- Individual entity count endpoints
Depends On
- ADR-011: Redis Single Instance Strategy
Context
The sidebar displays entity counts for all 17 entity types, causing:
- N+1 Problem: 17 separate API calls on page load
- Database Load: 17 COUNT(*) queries per page view
- Latency: Sequential requests add up
- Rate Limiting: May trigger limits on high traffic
- Cache Inefficiency: Each count cached separately
Performance before optimization:
- 14 separate API calls per page load
- ~300ms total response time
- Database hit on every uncached request
Decision
We implement batched entity counts with aggressive caching:
Key Design Decisions
- Single Endpoint:
/api/v1/entity-countsreturns all counts - Aggressive Caching: 60-second TTL in Redis
- Background Refresh: Cache warmed proactively
- React Query Integration: Single query key for all counts
- Fallback: Individual queries if batch fails
API Response
GET /api/v1/entity-counts
{
"counts": {
"requirements": 156,
"capabilities": 42,
"processes": 23,
"applications": 87,
"slis": 34,
"slos": 28,
...
},
"cached_at": "2025-01-16T10:30:00Z",
"cache_ttl": 60
}
Backend Implementation
@router.get("/entity-counts")
async def get_entity_counts(
db: Session = Depends(get_db),
cache: CacheManager = Depends(get_cache)
):
# Check cache first
cached = await cache.get("entity:counts:all")
if cached:
return cached
# Batch query all counts
counts = {}
for entity_type, model in ENTITY_MODELS.items():
counts[entity_type] = db.query(model).filter(
model.is_deleted == False
).count()
result = {"counts": counts, "cached_at": datetime.utcnow()}
await cache.set("entity:counts:all", result, ttl=60)
return result
Frontend Integration
const { data: counts } = useQuery({
queryKey: ['entity-counts'],
queryFn: () => api.get('/api/v1/entity-counts'),
staleTime: 30_000, // 30 seconds client-side
refetchInterval: 60_000, // Refresh every minute
});
Consequences
Positive
- 93% Request Reduction: 14 → 1 request per page load
- 100x Performance: 300ms → 3ms response time
- Database Relief: Single batch query vs 14 individual
- Cache Efficiency: One cache entry vs 14
- Simpler Frontend: Single query subscription
Negative
- All-or-Nothing: Can't refresh single count
- Stale Data: Up to 60 seconds behind reality
- Memory: Larger cache entry
- Coupling: All counts tied together
Neutral
- Cache Invalidation: Count changes don't immediately reflect
- Error Handling: Batch failure more impactful
Alternatives Considered
1. GraphQL Batching
- Approach: GraphQL query for multiple counts
- Rejected: GraphQL disabled due to conflicts
2. SSE Count Updates
- Approach: Server-sent events for real-time counts
- Rejected: Complexity, not needed for counts
3. Per-Entity Caching
- Approach: Cache each count with short TTL
- Rejected: Still 14 requests, less efficient
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Endpoint:
backend/api/v1/entity_counts.py - Cache Key:
entity:counts:all - Frontend Hook:
frontend/src/hooks/useEntityCounts.ts - Sidebar:
frontend/src/components/layout/Sidebar.tsx
Compliance/Validation
- Automated checks: Response time monitoring
- Manual review: Count accuracy verified
- Metrics: Cache hit rate, response time via Prometheus
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH MODIFICATIONS
Quality Metrics
- Consensus Strength Score (CSS): 0.90
- Deliberation Depth Index (DDI): 0.88
Council Feedback Summary
Batching is approved as correct, but the 60-second cache TTL is dangerous for SRE workflows. A 1-2 minute lag allows incidents to escalate while dashboards show "All Green."
Key Concerns Identified:
- Compound Staleness: 60s Redis TTL + 60s React Query refetch = worst-case 120 seconds lag
- One Size Doesn't Fit All: Treating Active Alerts the same as Total Teams is a fundamental error
- Coupling Risk: If one expensive query hangs, it blocks all critical operational data
Required Modifications:
- Tiered Caching ("Volatility Matrix"):
- Critical (Alerts, Incidents): TTL 5-10 seconds
- Operational (Deployments, On-call): TTL 30 seconds
- Reference (Inventory, Costs): TTL 60s-5min
- Partial Success Logic: Return 200 OK with null/error for failed entities, don't fail entire request
- Fix Client-Side: Reduce staleTime for critical data to 0-5s
- UI Transparency: Display "As of" timestamps so SREs know data age
- Event-Driven Invalidation: Use event hooks to immediately invalidate cache on new Incidents
Modifications Applied
- Documented tiered caching strategy
- Added partial success response pattern
- Added "Force Refresh" button recommendation
- Documented data age indicator requirement
Council Ranking
- gpt-5.2: Best Response (volatility matrix)
- claude-opus-4.5: Strong (UI transparency)
- gemini-3-pro: Good (event invalidation)
References
- API Batching Patterns
- Performance Benchmark:
backend/docs/architecture/redis-implementation-complete.md
ADR-014 | Caching Layer | Implemented