ADR-003: Hybrid Caching Strategy for Performance

Status: APPROVED as Critical Priority
Date: 2025-08-25
Author: Architecture Review Team

Context

Badge serving must achieve <200ms at p95 latency - this is non-negotiable for developer trust. With semantic analysis taking 5-30 seconds, aggressive caching is essential.

The current plan suggests Redis caching, but the complexity of cache invalidation with semantic updates needs deeper consideration.

Decision

Multi-tier caching with intelligent invalidation:

Tier 1 - CDN Edge Cache (Cloudflare):
  Location: Global edge locations
  TTL: 1 hour for badges, 24 hours for static badges
  Invalidation: On repository update webhooks
  Coverage: 95% of requests

Tier 2 - Application Cache (Redis):
  Location: Same region as app server
  TTL: 6 hours for analysis, 1 hour for computed badges
  Invalidation: Semantic analysis updates, user customizations
  Coverage: Fallback for cache misses

Tier 3 - Database Cache (PostgreSQL):
  Location: Persistent storage
  TTL: 7 days for analysis, indefinite for repositories
  Invalidation: Manual refresh, major algorithm updates
  Coverage: Source of truth

Consequences

Positive:

Can achieve <100ms p95 for cached badges (90%+ hit rate expected)
Graceful degradation during outages
Cost optimization through reduced API calls
Global performance through CDN

Negative:

Complex invalidation logic increases bug risk
Potential stale data issues during rapid development
Increased infrastructure complexity and cost
Cache consistency challenges across tiers

Alternatives Considered

Single-Tier Redis Cache
- Pros: Simple implementation
- Cons: Single point of failure, higher latency for global users
CDN-Only Caching
- Pros: Maximum performance, global distribution
- Cons: Limited control over invalidation, expensive for dynamic content
Database Query Optimization Only
- Pros: No cache complexity
- Cons: Cannot achieve <200ms requirement for badge generation
Edge Computing (Cloudflare Workers)
- Pros: Ultimate performance, serverless scaling
- Cons: Complex deployment, vendor lock-in

Risk Assessment

Critical Risks:

Cache Stampede: Popular repositories triggering simultaneous cache rebuilds
Stale Data: Users seeing outdated badges after repository updates
Memory Pressure: Redis instance sizing for growing repository count
Invalidation Bugs: Incorrect cache invalidation leading to inconsistent state

Mitigation Strategies:

Cache Stampede Prevention:

class CacheManager {
  private buildingCache = new Set<string>()
  
  async getOrBuild(key: string, builder: () => Promise<any>): Promise<any> {
    if (this.buildingCache.has(key)) {
      // Return stale data while building
      return this.getStaleData(key)
    }
    
    this.buildingCache.add(key)
    try {
      const result = await builder()
      await this.setCache(key, result)
      return result
    } finally {
      this.buildingCache.delete(key)
    }
  }
}

Smart Invalidation:

// Webhook handler for repository updates
export async function handleGitHubWebhook(payload: GitHubWebhook) {
  if (payload.action === 'push') {
    // Intelligent invalidation based on changed files
    const hasSignificantChanges = payload.commits.some(commit =>
      commit.modified.some(file => 
        file.endsWith('package.json') || 
        file.endsWith('requirements.txt') ||
        file === 'README.md'
      )
    )
    
    if (hasSignificantChanges) {
      await invalidateCaches(payload.repository.full_name)
      await queueReanalysis(payload.repository.full_name)
    }
  }
}

Migration Strategy

Phase 1: Application Caching (Week 1)

Implement Redis caching for analysis results
Basic TTL-based invalidation
Performance monitoring setup

Phase 2: CDN Integration (Week 2)

Configure Cloudflare caching rules
Implement webhook-based invalidation
Global performance testing

Phase 3: Optimization (Week 3)

Add cache warming for popular repositories
Implement stale-while-revalidate pattern
Advanced invalidation logic based on file changes

Conclusion

The multi-tier caching strategy is essential for meeting the <200ms performance requirement. While complex, the approach provides multiple fallback layers and enables global performance at scale. The implementation should prioritize cache stampede prevention and intelligent invalidation from day one.

Context​

Decision​

Consequences​

Alternatives Considered​

Risk Assessment​

Migration Strategy​

Conclusion​