ADR-003: Hybrid Caching Strategy for Performance
Status: APPROVED as Critical Priority
Date: 2025-08-25
Author: Architecture Review Team
Context
Badge serving must achieve <200ms at p95 latency - this is non-negotiable for developer trust. With semantic analysis taking 5-30 seconds, aggressive caching is essential.
The current plan suggests Redis caching, but the complexity of cache invalidation with semantic updates needs deeper consideration.
Decision
Multi-tier caching with intelligent invalidation:
Tier 1 - CDN Edge Cache (Cloudflare):
Location: Global edge locations
TTL: 1 hour for badges, 24 hours for static badges
Invalidation: On repository update webhooks
Coverage: 95% of requests
Tier 2 - Application Cache (Redis):
Location: Same region as app server
TTL: 6 hours for analysis, 1 hour for computed badges
Invalidation: Semantic analysis updates, user customizations
Coverage: Fallback for cache misses
Tier 3 - Database Cache (PostgreSQL):
Location: Persistent storage
TTL: 7 days for analysis, indefinite for repositories
Invalidation: Manual refresh, major algorithm updates
Coverage: Source of truth
Consequences
Positive:
- Can achieve <100ms p95 for cached badges (90%+ hit rate expected)
- Graceful degradation during outages
- Cost optimization through reduced API calls
- Global performance through CDN
Negative:
- Complex invalidation logic increases bug risk
- Potential stale data issues during rapid development
- Increased infrastructure complexity and cost
- Cache consistency challenges across tiers
Alternatives Considered
-
Single-Tier Redis Cache
- Pros: Simple implementation
- Cons: Single point of failure, higher latency for global users
-
CDN-Only Caching
- Pros: Maximum performance, global distribution
- Cons: Limited control over invalidation, expensive for dynamic content
-
Database Query Optimization Only
- Pros: No cache complexity
- Cons: Cannot achieve <200ms requirement for badge generation
-
Edge Computing (Cloudflare Workers)
- Pros: Ultimate performance, serverless scaling
- Cons: Complex deployment, vendor lock-in
Risk Assessment
Critical Risks:
- Cache Stampede: Popular repositories triggering simultaneous cache rebuilds
- Stale Data: Users seeing outdated badges after repository updates
- Memory Pressure: Redis instance sizing for growing repository count
- Invalidation Bugs: Incorrect cache invalidation leading to inconsistent state
Mitigation Strategies:
- Cache Stampede Prevention:
class CacheManager {
private buildingCache = new Set<string>()
async getOrBuild(key: string, builder: () => Promise<any>): Promise<any> {
if (this.buildingCache.has(key)) {
// Return stale data while building
return this.getStaleData(key)
}
this.buildingCache.add(key)
try {
const result = await builder()
await this.setCache(key, result)
return result
} finally {
this.buildingCache.delete(key)
}
}
}
- Smart Invalidation:
// Webhook handler for repository updates
export async function handleGitHubWebhook(payload: GitHubWebhook) {
if (payload.action === 'push') {
// Intelligent invalidation based on changed files
const hasSignificantChanges = payload.commits.some(commit =>
commit.modified.some(file =>
file.endsWith('package.json') ||
file.endsWith('requirements.txt') ||
file === 'README.md'
)
)
if (hasSignificantChanges) {
await invalidateCaches(payload.repository.full_name)
await queueReanalysis(payload.repository.full_name)
}
}
}
Migration Strategy
Phase 1: Application Caching (Week 1)
- Implement Redis caching for analysis results
- Basic TTL-based invalidation
- Performance monitoring setup
Phase 2: CDN Integration (Week 2)
- Configure Cloudflare caching rules
- Implement webhook-based invalidation
- Global performance testing
Phase 3: Optimization (Week 3)
- Add cache warming for popular repositories
- Implement stale-while-revalidate pattern
- Advanced invalidation logic based on file changes
Conclusion
The multi-tier caching strategy is essential for meeting the <200ms performance requirement. While complex, the approach provides multiple fallback layers and enables global performance at scale. The implementation should prioritize cache stampede prevention and intelligent invalidation from day one.