ADR-010: Regression Testing Strategy

Status

Implemented - Reviewed by LLM Council (2026-01-18), Implementation completed (2026-01-19)

Implementation Status

Component	Status	Implementation
3-Tier Testing Strategy	✅ Implemented	`e2e/regression/regression.config.json`
T0 Smoke Tests in CI	✅ Implemented	`.github/workflows/ci-build-test.yml`
Post-Deploy Regression	✅ Implemented	`.github/workflows/ci-cd-uat.yml`
Flaky Test Threshold (15%)	✅ Implemented	`e2e/regression/regression.config.json`
Risk-Based Selection	✅ Implemented	`e2e/regression/risk-calculator.ts`
Flaky Test Quarantine	✅ Implemented	`e2e/quarantine/`, `playwright.config.ts`

Context

Habit Hub requires a comprehensive regression testing strategy to ensure that code changes don't introduce regressions in existing functionality. The application has multiple test layers (unit, integration, E2E) and deployment environments (development, UAT, production) that need coordinated testing approaches.

Current State Assessment

What Exists

Component	Implementation	Location	Status
Unit Tests (Backend)	pytest with async support	`backend/tests/`	✅ Implemented
Unit Tests (Frontend)	Vitest + Testing Library	`frontend/src/**/__tests__/`	✅ Implemented
E2E Tests	Playwright multi-platform	`e2e/`	✅ Implemented
Smart Regression Suite	Risk-based test selection	`e2e/regression/`	⚠️ Partially Implemented
Flaky Test Detection	Auto-quarantine system	`e2e/quarantine/`	✅ Implemented
Visual Regression	Perceptual diffing	`e2e/regression/`	⚠️ Configured, not active
CI Integration	GitHub Actions	`.github/workflows/`	⚠️ Partial

Test Coverage Requirements (from CLAUDE.md)

Layer	Lines	Functions	Branches	Statements
Frontend	70%	70%	60%	70%
Backend	80%	75%	70%	80%

E2E Test Organization

e2e/
├── auth/                    # Authentication flows
├── habits/                  # Habit management
├── navigation/              # UI navigation
├── monitoring/              # Synthetic monitoring
├── regression/              # Smart regression suite
│   ├── regression.config.json
│   ├── suite-manager.ts
│   └── flaky-detector.ts
├── quarantine/              # Flaky test quarantine (non-blocking)
│   └── README.md            # Quarantine process documentation
├── fixtures/                # Reusable fixtures
├── pages/                   # Page Object Models
└── reporters/               # Linear integration

Gaps Identified

Smart Suite Selection Not Integrated: The regression suite manager exists but isn't used in CI/CD
No Post-Deployment Regression Tests: UAT deployment only runs smoke tests (HTTP status checks)
~~Flaky Test Detection Passive~~: ✅ RESOLVED - Quarantine suite implemented in e2e/quarantine/ with dedicated Playwright project
No Test Impact Analysis: Code changes don't trigger targeted test subsets
Visual Regression Inactive: Infrastructure exists but not running
No Performance Regression Testing: No baseline comparisons for response times
Linear Integration Incomplete: Configured but not fully activated for UAT failures

Decision

1. Tiered Regression Testing Strategy

Implement a streamlined 3-tier regression testing approach (simplified from initial 5-tier proposal based on LLM Council feedback):

Tier	Trigger	Suite	Duration	Blocking
T0 - Smoke	Every commit, PR open	Critical paths + risk-based selection	< 10 min	Yes
T1 - Integration	PR merge to develop, UAT deploy	Full regression + visual	< 30 min	Yes (CRITICAL only)
T2 - Comprehensive	Nightly (2 AM UTC)	Full + visual + perf + exploratory	< 90 min	No (report)

Rationale for 3-Tier Approach:

Combines T0/T1 into single Smoke tier (commit + PR gates serve similar purpose)
Merges T2/T3 into Integration tier (full regression at merge and deploy)
Consolidates T3/T4 into Comprehensive nightly run
Reduces cognitive overhead while maintaining coverage

2. Risk-Based Test Selection

Use existing regression.config.json to select tests based on:

{
  "riskFactors": {
    "auth": 1.0,
    "payment": 1.0,
    "database": 0.9,
    "api": 0.8,
    "habits": 0.8,
    "ui": 0.85,
    "sync": 0.9,
    "static": 0.3
  },
  "criticalPaths": [
    "auth/login.spec.ts",
    "auth/registration.spec.ts",
    "habits/daily-habits.spec.ts",
    "habits/habit-completion.spec.ts",
    "sync/timezone-handling.spec.ts"
  ],
  "domainSpecific": {
    "timezone_testing": true,
    "temporal_boundaries": ["midnight", "week_start", "month_end"],
    "offline_sync_scenarios": true
  }
}

Note: UI risk increased from 0.6 to 0.85 per council feedback - for a touch-first habit tracker, UI regressions directly impact user experience and habit completion rates. Added sync category (0.9) for offline/timezone scenarios critical to habit tracking domain.

3. Test Impact Analysis (Deferred)

Council Recommendation: Full test impact analysis mapping has been deferred due to high implementation effort vs. benefit ratio at current codebase scale.

Simplified Approach:

Use directory-based heuristics (changes in backend/app/api/v1/habits.py → run e2e/habits/*.spec.ts)
Rely on risk-based test selection from Tier 0 configuration
Revisit formal test impact mapping when codebase exceeds 50k LOC

# Simplified directory-based mapping (future consideration)
heuristics:
  - pattern: "backend/app/api/v1/*.py"
    run: "e2e/{dirname}/*.spec.ts"
  - pattern: "backend/app/core/authorization.py"
    run: "e2e/auth/*.spec.ts"

4. Flaky Test Management

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Test Execution │────▶│ Flaky Detection  │────▶│  Quarantine DB  │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                               │
                               ▼
                        ┌──────────────────┐
                        │ Linear Issue     │
                        │ (auto-created)   │
                        └──────────────────┘

Flakiness Criteria (tightened per council feedback):

Threshold: 15% failure rate over 10 runs (reduced from 30%)
Consecutive failures: 2 (trigger quarantine)
Auto-quarantine duration: 3 days (reduced from 7 days)
Review cadence: Daily triage, weekly deep-dive
Escalation: Tests quarantined > 2 cycles → Linear issue auto-created

Quarantine Suite Implementation (Issue #42):

The quarantine suite is implemented as a dedicated Playwright project:

// playwright.config.ts - quarantined project
{
  name: 'quarantined',
  testDir: './quarantine',
  use: { ...devices['Desktop Chrome'] },
  retries: 5,        // More retries for flaky tests
  timeout: 60 * 1000, // Longer timeout
}

CI Integration: The quarantine suite runs as a non-blocking job in CI:

# .github/workflows/ci-build-test.yml
quarantine-tests:
  name: Quarantine Suite (Non-Blocking)
  continue-on-error: true # Never blocks merge

Quarantine Process:

Move flaky test to e2e/quarantine/ directory
Add @quarantined metadata with Linear issue reference
Tests run in CI but don't block merges
Fix and promote back to main suite when stable

See e2e/quarantine/README.md for detailed quarantine process documentation.

5. Post-Deployment Verification

After UAT deployment, run:

Functional Smoke Tests (not just HTTP checks)
API Contract Validation (OpenAPI schema compliance)
Critical User Journeys (login → complete habit → view stats)
Performance Baseline (P95 response time < 500ms)

Implementation Plan

Phase 1: CI Integration (Week 1-2)

Integrate smart suite selection into GitHub Actions
Enable flaky test auto-quarantine
Add test impact analysis for PRs

Phase 2: Post-Deployment Testing (Week 3-4)

Expand UAT smoke tests to functional tests
Add API contract validation
Implement performance baseline checks

Phase 3: Advanced Features (Week 5-6)

Enable visual regression testing
Add nightly comprehensive suite
Implement test result trending dashboard

Consequences

Positive

Faster feedback on code changes (targeted tests)
Reduced flaky test noise in CI
Better confidence in UAT deployments
Clear test ownership via Linear integration

Negative

Increased CI complexity
Requires maintenance of test-impact mappings
Quarantine system needs monitoring

Risks

Over-reliance on risk scores may miss edge cases
Test impact maps can become stale
Flaky detection may quarantine legitimate failures

Metrics

Metric	Target	Current
T0 (Smoke) Suite Duration	< 10 min	~5 min
T1 (Integration) Suite Duration	< 30 min	~20 min
T2 (Comprehensive) Suite Duration	< 90 min	N/A
Flaky Test Rate	< 5%	Unknown
Test Coverage (Backend)	80%	~75%
Test Coverage (Frontend)	70%	~65%
Post-Deploy Test Pass Rate	> 99%	N/A

LLM Council Review

Review Date: 2026-01-18 Consensus: 0.85 (Strong Agreement)

Key Changes Based on Council Feedback

Tier Consolidation: Reduced from 5 tiers to 3 tiers
- Original T0/T1 merged → New T0 (Smoke)
- Original T2/T3 merged → New T1 (Integration)
- Original T3/T4 merged → New T2 (Comprehensive)
- Rationale: Reduces cognitive overhead, similar gates don't need separate tiers
UI Risk Score Increased: 0.6 → 0.85
- For touch-first habit tracker, UI regressions directly impact habit completion
- UI bugs have high user-facing impact in this domain
Flaky Test Thresholds Tightened:
- Failure rate threshold: 30% → 15%
- Quarantine duration: 7 days → 3 days
- Added escalation path for persistent flaky tests
Test Impact Analysis Deferred:
- High implementation effort vs. benefit at current codebase scale
- Using directory-based heuristics as simplified alternative
- Revisit when codebase exceeds 50k LOC
Domain-Specific Testing Added:
- Timezone/temporal boundary testing (midnight, week_start, month_end)
- Offline sync scenario testing
- Critical for habit tracking domain where timing matters

Council Dissent (Minority Opinion)

One model suggested even stricter flaky test thresholds (10% vs 15%)
Deferred to majority consensus pending initial data collection

Status​

Implementation Status​

Context​

Current State Assessment​

What Exists​

Test Coverage Requirements (from CLAUDE.md)​

E2E Test Organization​

Gaps Identified​

Decision​

1. Tiered Regression Testing Strategy​

2. Risk-Based Test Selection​

3. Test Impact Analysis (Deferred)​

4. Flaky Test Management​

5. Post-Deployment Verification​

Implementation Plan​

Phase 1: CI Integration (Week 1-2)​

Phase 2: Post-Deployment Testing (Week 3-4)​

Phase 3: Advanced Features (Week 5-6)​

Consequences​

Positive​

Negative​

Risks​

Metrics​

LLM Council Review​

Key Changes Based on Council Feedback​

Council Dissent (Minority Opinion)​

References​