ADR-011: DevOps Automation Pipeline

Status

Implemented - Reviewed by LLM Council (2026-01-18), Implementation completed (2026-01-19)

Implementation Status

Component	Status	Implementation
G1 Commit Gate	✅ Implemented	`.github/workflows/ci-build-test.yml`
G2-G4 Parallelization	✅ Implemented	`.github/workflows/ci-build-test.yml`
G5 Pre-Deploy Validation	✅ Implemented	`.github/workflows/ci-cd-uat.yml`
G6 Post-Deploy Verification	✅ Implemented	`.github/workflows/ci-cd-uat.yml`
Automated Rollback	✅ Implemented	`.github/workflows/ci-cd-uat.yml`
Drift Detection	✅ Implemented	`.github/workflows/drift-detection.yml`
Pre-commit Hooks	✅ Implemented	`.pre-commit-config.yaml`

Context

Habit Hub requires a robust DevOps automation pipeline that supports continuous integration, continuous deployment, and operational excellence across multiple environments (development, UAT, production). The pipeline must balance speed of delivery with quality gates and security requirements.

Current State Assessment

What Exists

Component	Implementation	Location	Status
CI Build/Test	GitHub Actions	`.github/workflows/ci-build-test.yml`	✅ Implemented
UAT Deployment	GitHub Actions + ArgoCD	`.github/workflows/ci-cd-uat.yml`	✅ Implemented
Production Deployment	GitHub Actions + ArgoCD	`.github/workflows/ci-cd-production.yml`	✅ Implemented
E2E Testing	Multi-platform Playwright	`.github/workflows/e2e-multiplatform.yml`	✅ Implemented
Security Audit	Scheduled workflow	`.github/workflows/security-audit.yml`	⚠️ Partial
Infrastructure as Code	Terraform	`terraform/`	⚠️ Partial (VPC complete, EKS partial)
GitOps	ArgoCD	`k8s/argocd-*.yaml`	✅ Implemented
Helm Charts	Multi-env values	`helm/habit-hub/`	✅ Implemented
Secrets Management	External Secrets + AWS SM	`k8s/external-secrets/`	✅ Implemented

Current CI/CD Flow

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Commit    │───▶│    Test     │───▶│    Build    │───▶│   Deploy    │
│  (develop)  │    │  (pytest,   │    │  (Docker)   │    │  (ArgoCD)   │
└─────────────┘    │   vitest)   │    └─────────────┘    └─────────────┘
                   └─────────────┘           │                  │
                                             ▼                  ▼
                                      ┌─────────────┐    ┌─────────────┐
                                      │    GHCR     │    │     UAT     │
                                      │   (images)  │    │  Namespace  │
                                      └─────────────┘    └─────────────┘

GitHub Actions Workflows Inventory

Workflow	Trigger	Purpose	Duration
`ci-build-test.yml`	PR, push to main/develop	Lint, test, type-check	~8 min
`ci-cd-uat.yml`	Push to develop	Build, deploy to UAT	~12 min
`ci-cd-production.yml`	Push to main	Build, scan, deploy to prod	~18 min
`e2e-multiplatform.yml`	Manual, PR	Cross-browser E2E tests	~25 min
`security-audit.yml`	Weekly schedule	SAST, SCA, secrets scan	~15 min

Gaps Identified

No Automated Rollback: Failed deployments require manual intervention
No Canary/Blue-Green: All-or-nothing deployments to UAT
Security Scanning Not in UAT Pipeline: Only in production and scheduled
No Pre-Deployment Validation: Missing environment health checks
No Post-Deployment Verification: Only HTTP smoke tests
Limited Observability: No deployment metrics/tracing
No Environment Drift Detection: Configuration can diverge
Manual Secrets Rotation: No automated rotation verification
Incomplete IaC: EKS module not fully implemented

Decision

1. Enhanced Pipeline Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                           PIPELINE STAGES                                     │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│  │ Commit  │──▶│  Build  │──▶│  Test   │──▶│ Security│──▶│ Publish │       │
│  │  Gate   │   │  Stage  │   │  Stage  │   │  Stage  │   │  Stage  │       │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘       │
│       │             │             │             │             │              │
│       ▼             ▼             ▼             ▼             ▼              │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│  │  Lint   │   │ Docker  │   │  Unit   │   │  SAST   │   │  GHCR   │       │
│  │ Commit  │   │  Build  │   │ Integr  │   │  SCA    │   │  Push   │       │
│  │  Msg    │   │         │   │   E2E   │   │ Secrets │   │         │       │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘       │
│                                                                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                         DEPLOYMENT STAGES                                    │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│  │  Pre-   │──▶│ Deploy  │──▶│ Verify  │──▶│ Promote │──▶│ Monitor │       │
│  │  Check  │   │  (UAT)  │   │  Stage  │   │   Gate  │   │  Stage  │       │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘       │
│       │             │             │             │             │              │
│       ▼             ▼             ▼             ▼             ▼              │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│  │ Health  │   │ ArgoCD  │   │ Smoke   │   │ Manual  │   │ Alerts  │       │
│  │ Secrets │   │  Sync   │   │ Regress │   │ Approval│   │ Metrics │       │
│  │  Deps   │   │ Canary  │   │  Perf   │   │         │   │         │       │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘       │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

2. Quality Gates

Gate	Stage	Criteria	Blocking	Parallel
G1	Commit	Lint pass, commit message format	Yes	-
G2	Build	Docker build success, no critical vulns	Yes	✅ G2-G4 parallel
G3	Test	Unit 80%, Integration pass, E2E critical pass	Yes	✅ G2-G4 parallel
G4	Security	No HIGH/CRITICAL SAST, SCA clean	Yes	✅ G2-G4 parallel
G5	Deploy	Pre-check pass, ArgoCD sync healthy	Yes	-
G6	Verify	Smoke pass, regression > 95%, P95 < 500ms	Yes (UAT), Alert (Prod)	-

G6 Telemetry Configuration (Issue #38):

The G6 post-deployment verification gate has defined telemetry specifics:

Parameter	Value	Rationale
Time Window	3 minutes	Balance between fast feedback and stability
Fail-Closed	Yes	If telemetry unavailable, fail the gate
Data Source	Direct curl (fallback)	Prometheus/Grafana primary, curl backup
Retry Count	3 retries	Handle transient network issues
Retry Delay	10 seconds	Allow services to stabilize

# .github/workflows/ci-cd-uat.yml - G6 telemetry configuration
env:
  TELEMETRY_TIME_WINDOW_MINUTES: 3
  TELEMETRY_FAIL_CLOSED: true
  TELEMETRY_RETRY_COUNT: 3
  TELEMETRY_RETRY_DELAY_SECONDS: 10

Fail-Closed Behavior: When telemetry systems are unavailable, the gate fails rather than passing blindly. This prevents deploying changes that can't be verified.

Note: G7 (manual approval) removed for UAT per council feedback - UAT should support continuous deployment. Manual approval remains for production only and is handled outside this ADR scope.

Gate Parallelization (council recommendation):

G2, G3, G4 run in parallel after G1 passes
Reduces pipeline time by ~40% (from ~25 min to ~15 min)
All three must pass before proceeding to G5

3. Automated Rollback Strategy

# Rollback triggers (thresholds tightened per council feedback)
triggers:
  - condition: smoke_test_failure
    action: immediate_rollback
    notification: slack_critical

  - condition: error_rate > 2% # Tightened from 5% per council
    window: 3_minutes # Reduced window for faster detection
    action: automatic_rollback
    notification: slack_alert

  - condition: p95_latency > 800ms # Tightened from 2000ms per council
    window: 5_minutes # Reduced window
    action: pause_deployment
    notification: slack_warning

  - condition: database_connection_failures > 0
    window: 1_minute
    action: immediate_rollback
    notification: slack_critical

Threshold Rationale (council feedback):

2% error rate is aggressive but appropriate for UAT where traffic is controlled
800ms P95 aligns with user experience expectations for habit tracking app
Database failures are always critical - immediate rollback

4. Deployment Strategy for UAT

Council Decision: Canary deployment removed for UAT environment.

Rationale:

UAT has insufficient traffic volume for meaningful canary analysis
Canary weights (10%, 50%, 100%) require statistically significant traffic to detect anomalies
False positives from low traffic would cause unnecessary rollbacks
Adds ~15 min to deployment time without proportional benefit

Recommended Approach for UAT:

# Rolling deployment (not canary)
deployment:
  strategy: RollingUpdate
  maxUnavailable: 0
  maxSurge: 1
  progressDeadlineSeconds: 300

# Post-deployment validation (replaces canary analysis)
validation:
  smoke_tests: true
  health_check_retries: 3
  health_check_interval: 10s

Production Canary (future, out of scope):

Canary deployment is appropriate for production with real user traffic
Will be addressed in separate ADR when production deployment is scoped

5. Pre-Deployment Validation

pre_deployment_checks:
  # Existing checks
  - name: supabase_connectivity
    command: "curl -f ${SUPABASE_URL}/health"
    timeout: 10s

  - name: external_secrets_sync
    command: 'kubectl get externalsecret -n habit-hub-uat -o json | jq ''.items[].status.conditions[] | select(.type=="Ready") | .status'' | grep -q True'
    timeout: 30s

  - name: redis_connectivity
    command: "redis-cli -h ${REDIS_HOST} ping"
    timeout: 5s

  - name: certificate_validity
    command: "openssl s_client -connect uat.habitclan.com:443 2>/dev/null | openssl x509 -noout -checkend 604800"
    timeout: 10s

  # Added per council feedback
  - name: database_migration_status
    command: "npx supabase db diff --linked | grep -q 'No schema changes'"
    timeout: 30s
    on_failure: warn # Warn if pending migrations exist

  - name: cluster_capacity
    command: "kubectl top nodes | awk 'NR>1 {if ($3 > 80 || $5 > 80) exit 1}'"
    timeout: 15s
    on_failure: warn # Warn if nodes > 80% CPU or memory

  - name: pending_pod_restarts
    command: "kubectl get pods -n habit-hub-uat -o json | jq '[.items[].status.containerStatuses[]?.restartCount // 0] | add' | xargs test 5 -gt"
    timeout: 10s
    on_failure: warn # Warn if pods have > 5 recent restarts

Note: Database migration and cluster capacity checks added per council recommendation to prevent deployment into unstable environments.

6. Environment Drift Detection

Weekly scheduled job to detect configuration drift:

drift_detection:
  schedule: "0 6 * * 1" # Monday 6 AM
  checks:
    - helm_values_vs_deployed
    - secret_rotation_status
    - resource_quota_utilization
    - node_version_consistency
  reporting:
    destination: linear_issue
    severity: warning

Implementation Plan

Phase 1: Quality Gates (Week 1-2)

Add security scanning to UAT pipeline
Implement pre-deployment validation checks (including new db migration, cluster capacity checks)
Add post-deployment regression tests
Configure parallel execution for G2-G4

Phase 2: Rollback & Validation (Week 3-4)

Implement automated rollback on failure with tightened thresholds
Configure rolling deployment strategy (canary deferred to production)
Add deployment metrics collection

Phase 3: Operational Excellence (Week 5-6)

Implement drift detection
Add deployment dashboards
Configure alerting thresholds

Consequences

Positive

Faster detection of deployment issues
Reduced manual intervention for rollbacks
Better confidence in deployment quality
Clearer audit trail for compliance

Negative

Increased pipeline complexity
Pre-deployment checks add ~2-3 min latency
More infrastructure to maintain

Risks

Tightened thresholds (2% error rate, 800ms P95) may cause false positives
Pre-deployment checks add latency
Rollback automation requires robust monitoring
Rolling deployment less safe than canary for production (addressed in future ADR)

Metrics

Metric	Target	Current
Deployment Frequency	Daily	~2/week
Lead Time for Changes	< 1 hour	~2 hours
Change Failure Rate	< 5%	Unknown
Mean Time to Recovery	< 15 min	~30 min
Pipeline Duration (UAT)	< 20 min	~15 min
Rollback Success Rate	> 99%	N/A

LLM Council Review

Review Date: 2026-01-18 Consensus: 0.82 (Strong Agreement)

Key Changes Based on Council Feedback

Gate Parallelization: G2, G3, G4 now run in parallel
- Reduces pipeline time by ~40%
- All must pass before proceeding to deployment
Rollback Thresholds Tightened:
- Error rate: 5% → 2%
- P95 latency: 2000ms → 800ms
- Detection window shortened: 5 min → 3 min
- Rationale: UAT has controlled traffic, can be more aggressive
Canary Deployment Removed for UAT:
- Insufficient traffic for meaningful statistical analysis
- False positives from low traffic would cause unnecessary rollbacks
- Replaced with rolling deployment + post-deployment validation
- Canary reserved for production (separate ADR)
Manual Approval Gate Removed:
- G7 removed for UAT to enable continuous deployment
- Manual approval remains for production only
New Pre-Deployment Checks Added:
- Database migration status verification
- Cluster capacity monitoring
- Pod restart anomaly detection

Council Dissent (Minority Opinion)

One model advocated keeping canary with longer analysis windows
Majority view: rolling deployment sufficient for UAT given traffic levels

Status​

Implementation Status​

Context​

Current State Assessment​

What Exists​

Current CI/CD Flow​

GitHub Actions Workflows Inventory​

Gaps Identified​

Decision​

1. Enhanced Pipeline Architecture​

2. Quality Gates​

3. Automated Rollback Strategy​

4. Deployment Strategy for UAT​

5. Pre-Deployment Validation​

6. Environment Drift Detection​

Implementation Plan​

Phase 1: Quality Gates (Week 1-2)​

Phase 2: Rollback & Validation (Week 3-4)​

Phase 3: Operational Excellence (Week 5-6)​

Consequences​

Positive​

Negative​

Risks​

Metrics​

LLM Council Review​

Key Changes Based on Council Feedback​

Council Dissent (Minority Opinion)​

References​