Skip to main content

ADR-010: Regression Testing Strategy

Status

Implemented - Reviewed by LLM Council (2026-01-18), Implementation completed (2026-01-19)

Implementation Status

ComponentStatusImplementation
3-Tier Testing Strategy✅ Implementede2e/regression/regression.config.json
T0 Smoke Tests in CI✅ Implemented.github/workflows/ci-build-test.yml
Post-Deploy Regression✅ Implemented.github/workflows/ci-cd-uat.yml
Flaky Test Threshold (15%)✅ Implementede2e/regression/regression.config.json
Risk-Based Selection✅ Implementede2e/regression/risk-calculator.ts
Flaky Test Quarantine✅ Implementede2e/quarantine/, playwright.config.ts

Context

Habit Hub requires a comprehensive regression testing strategy to ensure that code changes don't introduce regressions in existing functionality. The application has multiple test layers (unit, integration, E2E) and deployment environments (development, UAT, production) that need coordinated testing approaches.

Current State Assessment

What Exists

ComponentImplementationLocationStatus
Unit Tests (Backend)pytest with async supportbackend/tests/✅ Implemented
Unit Tests (Frontend)Vitest + Testing Libraryfrontend/src/**/__tests__/✅ Implemented
E2E TestsPlaywright multi-platforme2e/✅ Implemented
Smart Regression SuiteRisk-based test selectione2e/regression/⚠️ Partially Implemented
Flaky Test DetectionAuto-quarantine systeme2e/quarantine/✅ Implemented
Visual RegressionPerceptual diffinge2e/regression/⚠️ Configured, not active
CI IntegrationGitHub Actions.github/workflows/⚠️ Partial

Test Coverage Requirements (from CLAUDE.md)

LayerLinesFunctionsBranchesStatements
Frontend70%70%60%70%
Backend80%75%70%80%

E2E Test Organization

e2e/
├── auth/ # Authentication flows
├── habits/ # Habit management
├── navigation/ # UI navigation
├── monitoring/ # Synthetic monitoring
├── regression/ # Smart regression suite
│ ├── regression.config.json
│ ├── suite-manager.ts
│ └── flaky-detector.ts
├── quarantine/ # Flaky test quarantine (non-blocking)
│ └── README.md # Quarantine process documentation
├── fixtures/ # Reusable fixtures
├── pages/ # Page Object Models
└── reporters/ # Linear integration

Gaps Identified

  1. Smart Suite Selection Not Integrated: The regression suite manager exists but isn't used in CI/CD
  2. No Post-Deployment Regression Tests: UAT deployment only runs smoke tests (HTTP status checks)
  3. Flaky Test Detection Passive: ✅ RESOLVED - Quarantine suite implemented in e2e/quarantine/ with dedicated Playwright project
  4. No Test Impact Analysis: Code changes don't trigger targeted test subsets
  5. Visual Regression Inactive: Infrastructure exists but not running
  6. No Performance Regression Testing: No baseline comparisons for response times
  7. Linear Integration Incomplete: Configured but not fully activated for UAT failures

Decision

1. Tiered Regression Testing Strategy

Implement a streamlined 3-tier regression testing approach (simplified from initial 5-tier proposal based on LLM Council feedback):

TierTriggerSuiteDurationBlocking
T0 - SmokeEvery commit, PR openCritical paths + risk-based selection< 10 minYes
T1 - IntegrationPR merge to develop, UAT deployFull regression + visual< 30 minYes (CRITICAL only)
T2 - ComprehensiveNightly (2 AM UTC)Full + visual + perf + exploratory< 90 minNo (report)

Rationale for 3-Tier Approach:

  • Combines T0/T1 into single Smoke tier (commit + PR gates serve similar purpose)
  • Merges T2/T3 into Integration tier (full regression at merge and deploy)
  • Consolidates T3/T4 into Comprehensive nightly run
  • Reduces cognitive overhead while maintaining coverage

2. Risk-Based Test Selection

Use existing regression.config.json to select tests based on:

{
"riskFactors": {
"auth": 1.0,
"payment": 1.0,
"database": 0.9,
"api": 0.8,
"habits": 0.8,
"ui": 0.85,
"sync": 0.9,
"static": 0.3
},
"criticalPaths": [
"auth/login.spec.ts",
"auth/registration.spec.ts",
"habits/daily-habits.spec.ts",
"habits/habit-completion.spec.ts",
"sync/timezone-handling.spec.ts"
],
"domainSpecific": {
"timezone_testing": true,
"temporal_boundaries": ["midnight", "week_start", "month_end"],
"offline_sync_scenarios": true
}
}

Note: UI risk increased from 0.6 to 0.85 per council feedback - for a touch-first habit tracker, UI regressions directly impact user experience and habit completion rates. Added sync category (0.9) for offline/timezone scenarios critical to habit tracking domain.

3. Test Impact Analysis (Deferred)

Council Recommendation: Full test impact analysis mapping has been deferred due to high implementation effort vs. benefit ratio at current codebase scale.

Simplified Approach:

  • Use directory-based heuristics (changes in backend/app/api/v1/habits.py → run e2e/habits/*.spec.ts)
  • Rely on risk-based test selection from Tier 0 configuration
  • Revisit formal test impact mapping when codebase exceeds 50k LOC
# Simplified directory-based mapping (future consideration)
heuristics:
- pattern: "backend/app/api/v1/*.py"
run: "e2e/{dirname}/*.spec.ts"
- pattern: "backend/app/core/authorization.py"
run: "e2e/auth/*.spec.ts"

4. Flaky Test Management

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Test Execution │────▶│ Flaky Detection │────▶│ Quarantine DB │
└─────────────────┘ └──────────────────┘ └─────────────────┘


┌──────────────────┐
│ Linear Issue │
│ (auto-created) │
└──────────────────┘

Flakiness Criteria (tightened per council feedback):

  • Threshold: 15% failure rate over 10 runs (reduced from 30%)
  • Consecutive failures: 2 (trigger quarantine)
  • Auto-quarantine duration: 3 days (reduced from 7 days)
  • Review cadence: Daily triage, weekly deep-dive
  • Escalation: Tests quarantined > 2 cycles → Linear issue auto-created

Quarantine Suite Implementation (Issue #42):

The quarantine suite is implemented as a dedicated Playwright project:

// playwright.config.ts - quarantined project
{
name: 'quarantined',
testDir: './quarantine',
use: { ...devices['Desktop Chrome'] },
retries: 5, // More retries for flaky tests
timeout: 60 * 1000, // Longer timeout
}

CI Integration: The quarantine suite runs as a non-blocking job in CI:

# .github/workflows/ci-build-test.yml
quarantine-tests:
name: Quarantine Suite (Non-Blocking)
continue-on-error: true # Never blocks merge

Quarantine Process:

  1. Move flaky test to e2e/quarantine/ directory
  2. Add @quarantined metadata with Linear issue reference
  3. Tests run in CI but don't block merges
  4. Fix and promote back to main suite when stable

See e2e/quarantine/README.md for detailed quarantine process documentation.

5. Post-Deployment Verification

After UAT deployment, run:

  1. Functional Smoke Tests (not just HTTP checks)
  2. API Contract Validation (OpenAPI schema compliance)
  3. Critical User Journeys (login → complete habit → view stats)
  4. Performance Baseline (P95 response time < 500ms)

Implementation Plan

Phase 1: CI Integration (Week 1-2)

  1. Integrate smart suite selection into GitHub Actions
  2. Enable flaky test auto-quarantine
  3. Add test impact analysis for PRs

Phase 2: Post-Deployment Testing (Week 3-4)

  1. Expand UAT smoke tests to functional tests
  2. Add API contract validation
  3. Implement performance baseline checks

Phase 3: Advanced Features (Week 5-6)

  1. Enable visual regression testing
  2. Add nightly comprehensive suite
  3. Implement test result trending dashboard

Consequences

Positive

  • Faster feedback on code changes (targeted tests)
  • Reduced flaky test noise in CI
  • Better confidence in UAT deployments
  • Clear test ownership via Linear integration

Negative

  • Increased CI complexity
  • Requires maintenance of test-impact mappings
  • Quarantine system needs monitoring

Risks

  • Over-reliance on risk scores may miss edge cases
  • Test impact maps can become stale
  • Flaky detection may quarantine legitimate failures

Metrics

MetricTargetCurrent
T0 (Smoke) Suite Duration< 10 min~5 min
T1 (Integration) Suite Duration< 30 min~20 min
T2 (Comprehensive) Suite Duration< 90 minN/A
Flaky Test Rate< 5%Unknown
Test Coverage (Backend)80%~75%
Test Coverage (Frontend)70%~65%
Post-Deploy Test Pass Rate> 99%N/A

LLM Council Review

Review Date: 2026-01-18 Consensus: 0.85 (Strong Agreement)

Key Changes Based on Council Feedback

  1. Tier Consolidation: Reduced from 5 tiers to 3 tiers

    • Original T0/T1 merged → New T0 (Smoke)
    • Original T2/T3 merged → New T1 (Integration)
    • Original T3/T4 merged → New T2 (Comprehensive)
    • Rationale: Reduces cognitive overhead, similar gates don't need separate tiers
  2. UI Risk Score Increased: 0.6 → 0.85

    • For touch-first habit tracker, UI regressions directly impact habit completion
    • UI bugs have high user-facing impact in this domain
  3. Flaky Test Thresholds Tightened:

    • Failure rate threshold: 30% → 15%
    • Quarantine duration: 7 days → 3 days
    • Added escalation path for persistent flaky tests
  4. Test Impact Analysis Deferred:

    • High implementation effort vs. benefit at current codebase scale
    • Using directory-based heuristics as simplified alternative
    • Revisit when codebase exceeds 50k LOC
  5. Domain-Specific Testing Added:

    • Timezone/temporal boundary testing (midnight, week_start, month_end)
    • Offline sync scenario testing
    • Critical for habit tracking domain where timing matters

Council Dissent (Minority Opinion)

  • One model suggested even stricter flaky test thresholds (10% vs 15%)
  • Deferred to majority consensus pending initial data collection

References