Skip to main content

ADR-011: DevOps Automation Pipeline

Status

Implemented - Reviewed by LLM Council (2026-01-18), Implementation completed (2026-01-19)

Implementation Status

ComponentStatusImplementation
G1 Commit Gate✅ Implemented.github/workflows/ci-build-test.yml
G2-G4 Parallelization✅ Implemented.github/workflows/ci-build-test.yml
G5 Pre-Deploy Validation✅ Implemented.github/workflows/ci-cd-uat.yml
G6 Post-Deploy Verification✅ Implemented.github/workflows/ci-cd-uat.yml
Automated Rollback✅ Implemented.github/workflows/ci-cd-uat.yml
Drift Detection✅ Implemented.github/workflows/drift-detection.yml
Pre-commit Hooks✅ Implemented.pre-commit-config.yaml

Context

Habit Hub requires a robust DevOps automation pipeline that supports continuous integration, continuous deployment, and operational excellence across multiple environments (development, UAT, production). The pipeline must balance speed of delivery with quality gates and security requirements.

Current State Assessment

What Exists

ComponentImplementationLocationStatus
CI Build/TestGitHub Actions.github/workflows/ci-build-test.yml✅ Implemented
UAT DeploymentGitHub Actions + ArgoCD.github/workflows/ci-cd-uat.yml✅ Implemented
Production DeploymentGitHub Actions + ArgoCD.github/workflows/ci-cd-production.yml✅ Implemented
E2E TestingMulti-platform Playwright.github/workflows/e2e-multiplatform.yml✅ Implemented
Security AuditScheduled workflow.github/workflows/security-audit.yml⚠️ Partial
Infrastructure as CodeTerraformterraform/⚠️ Partial (VPC complete, EKS partial)
GitOpsArgoCDk8s/argocd-*.yaml✅ Implemented
Helm ChartsMulti-env valueshelm/habit-hub/✅ Implemented
Secrets ManagementExternal Secrets + AWS SMk8s/external-secrets/✅ Implemented

Current CI/CD Flow

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Commit │───▶│ Test │───▶│ Build │───▶│ Deploy │
│ (develop) │ │ (pytest, │ │ (Docker) │ │ (ArgoCD) │
└─────────────┘ │ vitest) │ └─────────────┘ └─────────────┘
└─────────────┘ │ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ GHCR │ │ UAT │
│ (images) │ │ Namespace │
└─────────────┘ └─────────────┘

GitHub Actions Workflows Inventory

WorkflowTriggerPurposeDuration
ci-build-test.ymlPR, push to main/developLint, test, type-check~8 min
ci-cd-uat.ymlPush to developBuild, deploy to UAT~12 min
ci-cd-production.ymlPush to mainBuild, scan, deploy to prod~18 min
e2e-multiplatform.ymlManual, PRCross-browser E2E tests~25 min
security-audit.ymlWeekly scheduleSAST, SCA, secrets scan~15 min

Gaps Identified

  1. No Automated Rollback: Failed deployments require manual intervention
  2. No Canary/Blue-Green: All-or-nothing deployments to UAT
  3. Security Scanning Not in UAT Pipeline: Only in production and scheduled
  4. No Pre-Deployment Validation: Missing environment health checks
  5. No Post-Deployment Verification: Only HTTP smoke tests
  6. Limited Observability: No deployment metrics/tracing
  7. No Environment Drift Detection: Configuration can diverge
  8. Manual Secrets Rotation: No automated rotation verification
  9. Incomplete IaC: EKS module not fully implemented

Decision

1. Enhanced Pipeline Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE STAGES │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Commit │──▶│ Build │──▶│ Test │──▶│ Security│──▶│ Publish │ │
│ │ Gate │ │ Stage │ │ Stage │ │ Stage │ │ Stage │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Lint │ │ Docker │ │ Unit │ │ SAST │ │ GHCR │ │
│ │ Commit │ │ Build │ │ Integr │ │ SCA │ │ Push │ │
│ │ Msg │ │ │ │ E2E │ │ Secrets │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────────────┤
│ DEPLOYMENT STAGES │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pre- │──▶│ Deploy │──▶│ Verify │──▶│ Promote │──▶│ Monitor │ │
│ │ Check │ │ (UAT) │ │ Stage │ │ Gate │ │ Stage │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Health │ │ ArgoCD │ │ Smoke │ │ Manual │ │ Alerts │ │
│ │ Secrets │ │ Sync │ │ Regress │ │ Approval│ │ Metrics │ │
│ │ Deps │ │ Canary │ │ Perf │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘

2. Quality Gates

GateStageCriteriaBlockingParallel
G1CommitLint pass, commit message formatYes-
G2BuildDocker build success, no critical vulnsYes✅ G2-G4 parallel
G3TestUnit 80%, Integration pass, E2E critical passYes✅ G2-G4 parallel
G4SecurityNo HIGH/CRITICAL SAST, SCA cleanYes✅ G2-G4 parallel
G5DeployPre-check pass, ArgoCD sync healthyYes-
G6VerifySmoke pass, regression > 95%, P95 < 500msYes (UAT), Alert (Prod)-

G6 Telemetry Configuration (Issue #38):

The G6 post-deployment verification gate has defined telemetry specifics:

ParameterValueRationale
Time Window3 minutesBalance between fast feedback and stability
Fail-ClosedYesIf telemetry unavailable, fail the gate
Data SourceDirect curl (fallback)Prometheus/Grafana primary, curl backup
Retry Count3 retriesHandle transient network issues
Retry Delay10 secondsAllow services to stabilize
# .github/workflows/ci-cd-uat.yml - G6 telemetry configuration
env:
TELEMETRY_TIME_WINDOW_MINUTES: 3
TELEMETRY_FAIL_CLOSED: true
TELEMETRY_RETRY_COUNT: 3
TELEMETRY_RETRY_DELAY_SECONDS: 10

Fail-Closed Behavior: When telemetry systems are unavailable, the gate fails rather than passing blindly. This prevents deploying changes that can't be verified.

Note: G7 (manual approval) removed for UAT per council feedback - UAT should support continuous deployment. Manual approval remains for production only and is handled outside this ADR scope.

Gate Parallelization (council recommendation):

  • G2, G3, G4 run in parallel after G1 passes
  • Reduces pipeline time by ~40% (from ~25 min to ~15 min)
  • All three must pass before proceeding to G5

3. Automated Rollback Strategy

# Rollback triggers (thresholds tightened per council feedback)
triggers:
- condition: smoke_test_failure
action: immediate_rollback
notification: slack_critical

- condition: error_rate > 2% # Tightened from 5% per council
window: 3_minutes # Reduced window for faster detection
action: automatic_rollback
notification: slack_alert

- condition: p95_latency > 800ms # Tightened from 2000ms per council
window: 5_minutes # Reduced window
action: pause_deployment
notification: slack_warning

- condition: database_connection_failures > 0
window: 1_minute
action: immediate_rollback
notification: slack_critical

Threshold Rationale (council feedback):

  • 2% error rate is aggressive but appropriate for UAT where traffic is controlled
  • 800ms P95 aligns with user experience expectations for habit tracking app
  • Database failures are always critical - immediate rollback

4. Deployment Strategy for UAT

Council Decision: Canary deployment removed for UAT environment.

Rationale:

  • UAT has insufficient traffic volume for meaningful canary analysis
  • Canary weights (10%, 50%, 100%) require statistically significant traffic to detect anomalies
  • False positives from low traffic would cause unnecessary rollbacks
  • Adds ~15 min to deployment time without proportional benefit

Recommended Approach for UAT:

# Rolling deployment (not canary)
deployment:
strategy: RollingUpdate
maxUnavailable: 0
maxSurge: 1
progressDeadlineSeconds: 300

# Post-deployment validation (replaces canary analysis)
validation:
smoke_tests: true
health_check_retries: 3
health_check_interval: 10s

Production Canary (future, out of scope):

  • Canary deployment is appropriate for production with real user traffic
  • Will be addressed in separate ADR when production deployment is scoped

5. Pre-Deployment Validation

pre_deployment_checks:
# Existing checks
- name: supabase_connectivity
command: "curl -f ${SUPABASE_URL}/health"
timeout: 10s

- name: external_secrets_sync
command: 'kubectl get externalsecret -n habit-hub-uat -o json | jq ''.items[].status.conditions[] | select(.type=="Ready") | .status'' | grep -q True'
timeout: 30s

- name: redis_connectivity
command: "redis-cli -h ${REDIS_HOST} ping"
timeout: 5s

- name: certificate_validity
command: "openssl s_client -connect uat.habitclan.com:443 2>/dev/null | openssl x509 -noout -checkend 604800"
timeout: 10s

# Added per council feedback
- name: database_migration_status
command: "npx supabase db diff --linked | grep -q 'No schema changes'"
timeout: 30s
on_failure: warn # Warn if pending migrations exist

- name: cluster_capacity
command: "kubectl top nodes | awk 'NR>1 {if ($3 > 80 || $5 > 80) exit 1}'"
timeout: 15s
on_failure: warn # Warn if nodes > 80% CPU or memory

- name: pending_pod_restarts
command: "kubectl get pods -n habit-hub-uat -o json | jq '[.items[].status.containerStatuses[]?.restartCount // 0] | add' | xargs test 5 -gt"
timeout: 10s
on_failure: warn # Warn if pods have > 5 recent restarts

Note: Database migration and cluster capacity checks added per council recommendation to prevent deployment into unstable environments.

6. Environment Drift Detection

Weekly scheduled job to detect configuration drift:

drift_detection:
schedule: "0 6 * * 1" # Monday 6 AM
checks:
- helm_values_vs_deployed
- secret_rotation_status
- resource_quota_utilization
- node_version_consistency
reporting:
destination: linear_issue
severity: warning

Implementation Plan

Phase 1: Quality Gates (Week 1-2)

  1. Add security scanning to UAT pipeline
  2. Implement pre-deployment validation checks (including new db migration, cluster capacity checks)
  3. Add post-deployment regression tests
  4. Configure parallel execution for G2-G4

Phase 2: Rollback & Validation (Week 3-4)

  1. Implement automated rollback on failure with tightened thresholds
  2. Configure rolling deployment strategy (canary deferred to production)
  3. Add deployment metrics collection

Phase 3: Operational Excellence (Week 5-6)

  1. Implement drift detection
  2. Add deployment dashboards
  3. Configure alerting thresholds

Consequences

Positive

  • Faster detection of deployment issues
  • Reduced manual intervention for rollbacks
  • Better confidence in deployment quality
  • Clearer audit trail for compliance

Negative

  • Increased pipeline complexity
  • Pre-deployment checks add ~2-3 min latency
  • More infrastructure to maintain

Risks

  • Tightened thresholds (2% error rate, 800ms P95) may cause false positives
  • Pre-deployment checks add latency
  • Rollback automation requires robust monitoring
  • Rolling deployment less safe than canary for production (addressed in future ADR)

Metrics

MetricTargetCurrent
Deployment FrequencyDaily~2/week
Lead Time for Changes< 1 hour~2 hours
Change Failure Rate< 5%Unknown
Mean Time to Recovery< 15 min~30 min
Pipeline Duration (UAT)< 20 min~15 min
Rollback Success Rate> 99%N/A

LLM Council Review

Review Date: 2026-01-18 Consensus: 0.82 (Strong Agreement)

Key Changes Based on Council Feedback

  1. Gate Parallelization: G2, G3, G4 now run in parallel

    • Reduces pipeline time by ~40%
    • All must pass before proceeding to deployment
  2. Rollback Thresholds Tightened:

    • Error rate: 5% → 2%
    • P95 latency: 2000ms → 800ms
    • Detection window shortened: 5 min → 3 min
    • Rationale: UAT has controlled traffic, can be more aggressive
  3. Canary Deployment Removed for UAT:

    • Insufficient traffic for meaningful statistical analysis
    • False positives from low traffic would cause unnecessary rollbacks
    • Replaced with rolling deployment + post-deployment validation
    • Canary reserved for production (separate ADR)
  4. Manual Approval Gate Removed:

    • G7 removed for UAT to enable continuous deployment
    • Manual approval remains for production only
  5. New Pre-Deployment Checks Added:

    • Database migration status verification
    • Cluster capacity monitoring
    • Pod restart anomaly detection

Council Dissent (Minority Opinion)

  • One model advocated keeping canary with longer analysis windows
  • Majority view: rolling deployment sufficient for UAT given traffic levels

References