ADR-010: Regression Testing Strategy
Status
Implemented - Reviewed by LLM Council (2026-01-18), Implementation completed (2026-01-19)
Implementation Status
| Component | Status | Implementation |
|---|---|---|
| 3-Tier Testing Strategy | ✅ Implemented | e2e/regression/regression.config.json |
| T0 Smoke Tests in CI | ✅ Implemented | .github/workflows/ci-build-test.yml |
| Post-Deploy Regression | ✅ Implemented | .github/workflows/ci-cd-uat.yml |
| Flaky Test Threshold (15%) | ✅ Implemented | e2e/regression/regression.config.json |
| Risk-Based Selection | ✅ Implemented | e2e/regression/risk-calculator.ts |
| Flaky Test Quarantine | ✅ Implemented | e2e/quarantine/, playwright.config.ts |
Context
Habit Hub requires a comprehensive regression testing strategy to ensure that code changes don't introduce regressions in existing functionality. The application has multiple test layers (unit, integration, E2E) and deployment environments (development, UAT, production) that need coordinated testing approaches.
Current State Assessment
What Exists
| Component | Implementation | Location | Status |
|---|---|---|---|
| Unit Tests (Backend) | pytest with async support | backend/tests/ | ✅ Implemented |
| Unit Tests (Frontend) | Vitest + Testing Library | frontend/src/**/__tests__/ | ✅ Implemented |
| E2E Tests | Playwright multi-platform | e2e/ | ✅ Implemented |
| Smart Regression Suite | Risk-based test selection | e2e/regression/ | ⚠️ Partially Implemented |
| Flaky Test Detection | Auto-quarantine system | e2e/quarantine/ | ✅ Implemented |
| Visual Regression | Perceptual diffing | e2e/regression/ | ⚠️ Configured, not active |
| CI Integration | GitHub Actions | .github/workflows/ | ⚠️ Partial |
Test Coverage Requirements (from CLAUDE.md)
| Layer | Lines | Functions | Branches | Statements |
|---|---|---|---|---|
| Frontend | 70% | 70% | 60% | 70% |
| Backend | 80% | 75% | 70% | 80% |
E2E Test Organization
e2e/
├── auth/ # Authentication flows
├── habits/ # Habit management
├── navigation/ # UI navigation
├── monitoring/ # Synthetic monitoring
├── regression/ # Smart regression suite
│ ├── regression.config.json
│ ├── suite-manager.ts
│ └── flaky-detector.ts
├── quarantine/ # Flaky test quarantine (non-blocking)
│ └── README.md # Quarantine process documentation
├── fixtures/ # Reusable fixtures
├── pages/ # Page Object Models
└── reporters/ # Linear integration
Gaps Identified
- Smart Suite Selection Not Integrated: The regression suite manager exists but isn't used in CI/CD
- No Post-Deployment Regression Tests: UAT deployment only runs smoke tests (HTTP status checks)
Flaky Test Detection Passive: ✅ RESOLVED - Quarantine suite implemented ine2e/quarantine/with dedicated Playwright project- No Test Impact Analysis: Code changes don't trigger targeted test subsets
- Visual Regression Inactive: Infrastructure exists but not running
- No Performance Regression Testing: No baseline comparisons for response times
- Linear Integration Incomplete: Configured but not fully activated for UAT failures
Decision
1. Tiered Regression Testing Strategy
Implement a streamlined 3-tier regression testing approach (simplified from initial 5-tier proposal based on LLM Council feedback):
| Tier | Trigger | Suite | Duration | Blocking |
|---|---|---|---|---|
| T0 - Smoke | Every commit, PR open | Critical paths + risk-based selection | < 10 min | Yes |
| T1 - Integration | PR merge to develop, UAT deploy | Full regression + visual | < 30 min | Yes (CRITICAL only) |
| T2 - Comprehensive | Nightly (2 AM UTC) | Full + visual + perf + exploratory | < 90 min | No (report) |
Rationale for 3-Tier Approach:
- Combines T0/T1 into single Smoke tier (commit + PR gates serve similar purpose)
- Merges T2/T3 into Integration tier (full regression at merge and deploy)
- Consolidates T3/T4 into Comprehensive nightly run
- Reduces cognitive overhead while maintaining coverage
2. Risk-Based Test Selection
Use existing regression.config.json to select tests based on:
{
"riskFactors": {
"auth": 1.0,
"payment": 1.0,
"database": 0.9,
"api": 0.8,
"habits": 0.8,
"ui": 0.85,
"sync": 0.9,
"static": 0.3
},
"criticalPaths": [
"auth/login.spec.ts",
"auth/registration.spec.ts",
"habits/daily-habits.spec.ts",
"habits/habit-completion.spec.ts",
"sync/timezone-handling.spec.ts"
],
"domainSpecific": {
"timezone_testing": true,
"temporal_boundaries": ["midnight", "week_start", "month_end"],
"offline_sync_scenarios": true
}
}
Note: UI risk increased from 0.6 to 0.85 per council feedback - for a touch-first habit tracker, UI regressions directly impact user experience and habit completion rates. Added sync category (0.9) for offline/timezone scenarios critical to habit tracking domain.
3. Test Impact Analysis (Deferred)
Council Recommendation: Full test impact analysis mapping has been deferred due to high implementation effort vs. benefit ratio at current codebase scale.
Simplified Approach:
- Use directory-based heuristics (changes in
backend/app/api/v1/habits.py→ rune2e/habits/*.spec.ts) - Rely on risk-based test selection from Tier 0 configuration
- Revisit formal test impact mapping when codebase exceeds 50k LOC
# Simplified directory-based mapping (future consideration)
heuristics:
- pattern: "backend/app/api/v1/*.py"
run: "e2e/{dirname}/*.spec.ts"
- pattern: "backend/app/core/authorization.py"
run: "e2e/auth/*.spec.ts"
4. Flaky Test Management
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Test Execution │────▶│ Flaky Detection │────▶│ Quarantine DB │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ Linear Issue │
│ (auto-created) │
└──────────────────┘
Flakiness Criteria (tightened per council feedback):
- Threshold: 15% failure rate over 10 runs (reduced from 30%)
- Consecutive failures: 2 (trigger quarantine)
- Auto-quarantine duration: 3 days (reduced from 7 days)
- Review cadence: Daily triage, weekly deep-dive
- Escalation: Tests quarantined > 2 cycles → Linear issue auto-created
Quarantine Suite Implementation (Issue #42):
The quarantine suite is implemented as a dedicated Playwright project:
// playwright.config.ts - quarantined project
{
name: 'quarantined',
testDir: './quarantine',
use: { ...devices['Desktop Chrome'] },
retries: 5, // More retries for flaky tests
timeout: 60 * 1000, // Longer timeout
}
CI Integration: The quarantine suite runs as a non-blocking job in CI:
# .github/workflows/ci-build-test.yml
quarantine-tests:
name: Quarantine Suite (Non-Blocking)
continue-on-error: true # Never blocks merge
Quarantine Process:
- Move flaky test to
e2e/quarantine/directory - Add
@quarantinedmetadata with Linear issue reference - Tests run in CI but don't block merges
- Fix and promote back to main suite when stable
See e2e/quarantine/README.md for detailed quarantine process documentation.
5. Post-Deployment Verification
After UAT deployment, run:
- Functional Smoke Tests (not just HTTP checks)
- API Contract Validation (OpenAPI schema compliance)
- Critical User Journeys (login → complete habit → view stats)
- Performance Baseline (P95 response time < 500ms)
Implementation Plan
Phase 1: CI Integration (Week 1-2)
- Integrate smart suite selection into GitHub Actions
- Enable flaky test auto-quarantine
- Add test impact analysis for PRs
Phase 2: Post-Deployment Testing (Week 3-4)
- Expand UAT smoke tests to functional tests
- Add API contract validation
- Implement performance baseline checks
Phase 3: Advanced Features (Week 5-6)
- Enable visual regression testing
- Add nightly comprehensive suite
- Implement test result trending dashboard
Consequences
Positive
- Faster feedback on code changes (targeted tests)
- Reduced flaky test noise in CI
- Better confidence in UAT deployments
- Clear test ownership via Linear integration
Negative
- Increased CI complexity
- Requires maintenance of test-impact mappings
- Quarantine system needs monitoring
Risks
- Over-reliance on risk scores may miss edge cases
- Test impact maps can become stale
- Flaky detection may quarantine legitimate failures
Metrics
| Metric | Target | Current |
|---|---|---|
| T0 (Smoke) Suite Duration | < 10 min | ~5 min |
| T1 (Integration) Suite Duration | < 30 min | ~20 min |
| T2 (Comprehensive) Suite Duration | < 90 min | N/A |
| Flaky Test Rate | < 5% | Unknown |
| Test Coverage (Backend) | 80% | ~75% |
| Test Coverage (Frontend) | 70% | ~65% |
| Post-Deploy Test Pass Rate | > 99% | N/A |
LLM Council Review
Review Date: 2026-01-18 Consensus: 0.85 (Strong Agreement)
Key Changes Based on Council Feedback
-
Tier Consolidation: Reduced from 5 tiers to 3 tiers
- Original T0/T1 merged → New T0 (Smoke)
- Original T2/T3 merged → New T1 (Integration)
- Original T3/T4 merged → New T2 (Comprehensive)
- Rationale: Reduces cognitive overhead, similar gates don't need separate tiers
-
UI Risk Score Increased: 0.6 → 0.85
- For touch-first habit tracker, UI regressions directly impact habit completion
- UI bugs have high user-facing impact in this domain
-
Flaky Test Thresholds Tightened:
- Failure rate threshold: 30% → 15%
- Quarantine duration: 7 days → 3 days
- Added escalation path for persistent flaky tests
-
Test Impact Analysis Deferred:
- High implementation effort vs. benefit at current codebase scale
- Using directory-based heuristics as simplified alternative
- Revisit when codebase exceeds 50k LOC
-
Domain-Specific Testing Added:
- Timezone/temporal boundary testing (midnight, week_start, month_end)
- Offline sync scenario testing
- Critical for habit tracking domain where timing matters
Council Dissent (Minority Opinion)
- One model suggested even stricter flaky test thresholds (10% vs 15%)
- Deferred to majority consensus pending initial data collection