Skip to main content

ADR-003: Hybrid Caching Strategy for Performance

Status: APPROVED as Critical Priority
Date: 2025-08-25
Author: Architecture Review Team

Context

Badge serving must achieve <200ms at p95 latency - this is non-negotiable for developer trust. With semantic analysis taking 5-30 seconds, aggressive caching is essential.

The current plan suggests Redis caching, but the complexity of cache invalidation with semantic updates needs deeper consideration.

Decision

Multi-tier caching with intelligent invalidation:

Tier 1 - CDN Edge Cache (Cloudflare):
Location: Global edge locations
TTL: 1 hour for badges, 24 hours for static badges
Invalidation: On repository update webhooks
Coverage: 95% of requests

Tier 2 - Application Cache (Redis):
Location: Same region as app server
TTL: 6 hours for analysis, 1 hour for computed badges
Invalidation: Semantic analysis updates, user customizations
Coverage: Fallback for cache misses

Tier 3 - Database Cache (PostgreSQL):
Location: Persistent storage
TTL: 7 days for analysis, indefinite for repositories
Invalidation: Manual refresh, major algorithm updates
Coverage: Source of truth

Consequences

Positive:

  • Can achieve <100ms p95 for cached badges (90%+ hit rate expected)
  • Graceful degradation during outages
  • Cost optimization through reduced API calls
  • Global performance through CDN

Negative:

  • Complex invalidation logic increases bug risk
  • Potential stale data issues during rapid development
  • Increased infrastructure complexity and cost
  • Cache consistency challenges across tiers

Alternatives Considered

  1. Single-Tier Redis Cache

    • Pros: Simple implementation
    • Cons: Single point of failure, higher latency for global users
  2. CDN-Only Caching

    • Pros: Maximum performance, global distribution
    • Cons: Limited control over invalidation, expensive for dynamic content
  3. Database Query Optimization Only

    • Pros: No cache complexity
    • Cons: Cannot achieve <200ms requirement for badge generation
  4. Edge Computing (Cloudflare Workers)

    • Pros: Ultimate performance, serverless scaling
    • Cons: Complex deployment, vendor lock-in

Risk Assessment

Critical Risks:

  • Cache Stampede: Popular repositories triggering simultaneous cache rebuilds
  • Stale Data: Users seeing outdated badges after repository updates
  • Memory Pressure: Redis instance sizing for growing repository count
  • Invalidation Bugs: Incorrect cache invalidation leading to inconsistent state

Mitigation Strategies:

  1. Cache Stampede Prevention:
class CacheManager {
private buildingCache = new Set<string>()

async getOrBuild(key: string, builder: () => Promise<any>): Promise<any> {
if (this.buildingCache.has(key)) {
// Return stale data while building
return this.getStaleData(key)
}

this.buildingCache.add(key)
try {
const result = await builder()
await this.setCache(key, result)
return result
} finally {
this.buildingCache.delete(key)
}
}
}
  1. Smart Invalidation:
// Webhook handler for repository updates
export async function handleGitHubWebhook(payload: GitHubWebhook) {
if (payload.action === 'push') {
// Intelligent invalidation based on changed files
const hasSignificantChanges = payload.commits.some(commit =>
commit.modified.some(file =>
file.endsWith('package.json') ||
file.endsWith('requirements.txt') ||
file === 'README.md'
)
)

if (hasSignificantChanges) {
await invalidateCaches(payload.repository.full_name)
await queueReanalysis(payload.repository.full_name)
}
}
}

Migration Strategy

Phase 1: Application Caching (Week 1)

  • Implement Redis caching for analysis results
  • Basic TTL-based invalidation
  • Performance monitoring setup

Phase 2: CDN Integration (Week 2)

  • Configure Cloudflare caching rules
  • Implement webhook-based invalidation
  • Global performance testing

Phase 3: Optimization (Week 3)

  • Add cache warming for popular repositories
  • Implement stale-while-revalidate pattern
  • Advanced invalidation logic based on file changes

Conclusion

The multi-tier caching strategy is essential for meeting the <200ms performance requirement. While complex, the approach provides multiple fallback layers and enables global performance at scale. The implementation should prioritize cache stampede prevention and intelligent invalidation from day one.