ADR-011: DevOps Automation Pipeline
Status
Implemented - Reviewed by LLM Council (2026-01-18), Implementation completed (2026-01-19)
Implementation Status
| Component | Status | Implementation |
|---|---|---|
| G1 Commit Gate | ✅ Implemented | .github/workflows/ci-build-test.yml |
| G2-G4 Parallelization | ✅ Implemented | .github/workflows/ci-build-test.yml |
| G5 Pre-Deploy Validation | ✅ Implemented | .github/workflows/ci-cd-uat.yml |
| G6 Post-Deploy Verification | ✅ Implemented | .github/workflows/ci-cd-uat.yml |
| Automated Rollback | ✅ Implemented | .github/workflows/ci-cd-uat.yml |
| Drift Detection | ✅ Implemented | .github/workflows/drift-detection.yml |
| Pre-commit Hooks | ✅ Implemented | .pre-commit-config.yaml |
Context
Habit Hub requires a robust DevOps automation pipeline that supports continuous integration, continuous deployment, and operational excellence across multiple environments (development, UAT, production). The pipeline must balance speed of delivery with quality gates and security requirements.
Current State Assessment
What Exists
| Component | Implementation | Location | Status |
|---|---|---|---|
| CI Build/Test | GitHub Actions | .github/workflows/ci-build-test.yml | ✅ Implemented |
| UAT Deployment | GitHub Actions + ArgoCD | .github/workflows/ci-cd-uat.yml | ✅ Implemented |
| Production Deployment | GitHub Actions + ArgoCD | .github/workflows/ci-cd-production.yml | ✅ Implemented |
| E2E Testing | Multi-platform Playwright | .github/workflows/e2e-multiplatform.yml | ✅ Implemented |
| Security Audit | Scheduled workflow | .github/workflows/security-audit.yml | ⚠️ Partial |
| Infrastructure as Code | Terraform | terraform/ | ⚠️ Partial (VPC complete, EKS partial) |
| GitOps | ArgoCD | k8s/argocd-*.yaml | ✅ Implemented |
| Helm Charts | Multi-env values | helm/habit-hub/ | ✅ Implemented |
| Secrets Management | External Secrets + AWS SM | k8s/external-secrets/ | ✅ Implemented |
Current CI/CD Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Commit │───▶│ Test │───▶│ Build │───▶│ Deploy │
│ (develop) │ │ (pytest, │ │ (Docker) │ │ (ArgoCD) │
└─────────────┘ │ vitest) │ └─────────────┘ └─────────────┘
└─────────────┘ │ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ GHCR │ │ UAT │
│ (images) │ │ Namespace │
└─────────────┘ └─────────────┘
GitHub Actions Workflows Inventory
| Workflow | Trigger | Purpose | Duration |
|---|---|---|---|
ci-build-test.yml | PR, push to main/develop | Lint, test, type-check | ~8 min |
ci-cd-uat.yml | Push to develop | Build, deploy to UAT | ~12 min |
ci-cd-production.yml | Push to main | Build, scan, deploy to prod | ~18 min |
e2e-multiplatform.yml | Manual, PR | Cross-browser E2E tests | ~25 min |
security-audit.yml | Weekly schedule | SAST, SCA, secrets scan | ~15 min |
Gaps Identified
- No Automated Rollback: Failed deployments require manual intervention
- No Canary/Blue-Green: All-or-nothing deployments to UAT
- Security Scanning Not in UAT Pipeline: Only in production and scheduled
- No Pre-Deployment Validation: Missing environment health checks
- No Post-Deployment Verification: Only HTTP smoke tests
- Limited Observability: No deployment metrics/tracing
- No Environment Drift Detection: Configuration can diverge
- Manual Secrets Rotation: No automated rotation verification
- Incomplete IaC: EKS module not fully implemented
Decision
1. Enhanced Pipeline Architecture
┌──────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE STAGES │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Commit │──▶│ Build │──▶│ Test │──▶│ Security│──▶│ Publish │ │
│ │ Gate │ │ Stage │ │ Stage │ │ Stage │ │ Stage │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Lint │ │ Docker │ │ Unit │ │ SAST │ │ GHCR │ │
│ │ Commit │ │ Build │ │ Integr │ │ SCA │ │ Push │ │
│ │ Msg │ │ │ │ E2E │ │ Secrets │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────────────┤
│ DEPLOYMENT STAGES │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pre- │──▶│ Deploy │──▶│ Verify │──▶│ Promote │──▶│ Monitor │ │
│ │ Check │ │ (UAT) │ │ Stage │ │ Gate │ │ Stage │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Health │ │ ArgoCD │ │ Smoke │ │ Manual │ │ Alerts │ │
│ │ Secrets │ │ Sync │ │ Regress │ │ Approval│ │ Metrics │ │
│ │ Deps │ │ Canary │ │ Perf │ │ │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
2. Quality Gates
| Gate | Stage | Criteria | Blocking | Parallel |
|---|---|---|---|---|
| G1 | Commit | Lint pass, commit message format | Yes | - |
| G2 | Build | Docker build success, no critical vulns | Yes | ✅ G2-G4 parallel |
| G3 | Test | Unit 80%, Integration pass, E2E critical pass | Yes | ✅ G2-G4 parallel |
| G4 | Security | No HIGH/CRITICAL SAST, SCA clean | Yes | ✅ G2-G4 parallel |
| G5 | Deploy | Pre-check pass, ArgoCD sync healthy | Yes | - |
| G6 | Verify | Smoke pass, regression > 95%, P95 < 500ms | Yes (UAT), Alert (Prod) | - |
G6 Telemetry Configuration (Issue #38):
The G6 post-deployment verification gate has defined telemetry specifics:
| Parameter | Value | Rationale |
|---|---|---|
| Time Window | 3 minutes | Balance between fast feedback and stability |
| Fail-Closed | Yes | If telemetry unavailable, fail the gate |
| Data Source | Direct curl (fallback) | Prometheus/Grafana primary, curl backup |
| Retry Count | 3 retries | Handle transient network issues |
| Retry Delay | 10 seconds | Allow services to stabilize |
# .github/workflows/ci-cd-uat.yml - G6 telemetry configuration
env:
TELEMETRY_TIME_WINDOW_MINUTES: 3
TELEMETRY_FAIL_CLOSED: true
TELEMETRY_RETRY_COUNT: 3
TELEMETRY_RETRY_DELAY_SECONDS: 10
Fail-Closed Behavior: When telemetry systems are unavailable, the gate fails rather than passing blindly. This prevents deploying changes that can't be verified.
Note: G7 (manual approval) removed for UAT per council feedback - UAT should support continuous deployment. Manual approval remains for production only and is handled outside this ADR scope.
Gate Parallelization (council recommendation):
- G2, G3, G4 run in parallel after G1 passes
- Reduces pipeline time by ~40% (from ~25 min to ~15 min)
- All three must pass before proceeding to G5
3. Automated Rollback Strategy
# Rollback triggers (thresholds tightened per council feedback)
triggers:
- condition: smoke_test_failure
action: immediate_rollback
notification: slack_critical
- condition: error_rate > 2% # Tightened from 5% per council
window: 3_minutes # Reduced window for faster detection
action: automatic_rollback
notification: slack_alert
- condition: p95_latency > 800ms # Tightened from 2000ms per council
window: 5_minutes # Reduced window
action: pause_deployment
notification: slack_warning
- condition: database_connection_failures > 0
window: 1_minute
action: immediate_rollback
notification: slack_critical
Threshold Rationale (council feedback):
- 2% error rate is aggressive but appropriate for UAT where traffic is controlled
- 800ms P95 aligns with user experience expectations for habit tracking app
- Database failures are always critical - immediate rollback
4. Deployment Strategy for UAT
Council Decision: Canary deployment removed for UAT environment.
Rationale:
- UAT has insufficient traffic volume for meaningful canary analysis
- Canary weights (10%, 50%, 100%) require statistically significant traffic to detect anomalies
- False positives from low traffic would cause unnecessary rollbacks
- Adds ~15 min to deployment time without proportional benefit
Recommended Approach for UAT:
# Rolling deployment (not canary)
deployment:
strategy: RollingUpdate
maxUnavailable: 0
maxSurge: 1
progressDeadlineSeconds: 300
# Post-deployment validation (replaces canary analysis)
validation:
smoke_tests: true
health_check_retries: 3
health_check_interval: 10s
Production Canary (future, out of scope):
- Canary deployment is appropriate for production with real user traffic
- Will be addressed in separate ADR when production deployment is scoped
5. Pre-Deployment Validation
pre_deployment_checks:
# Existing checks
- name: supabase_connectivity
command: "curl -f ${SUPABASE_URL}/health"
timeout: 10s
- name: external_secrets_sync
command: 'kubectl get externalsecret -n habit-hub-uat -o json | jq ''.items[].status.conditions[] | select(.type=="Ready") | .status'' | grep -q True'
timeout: 30s
- name: redis_connectivity
command: "redis-cli -h ${REDIS_HOST} ping"
timeout: 5s
- name: certificate_validity
command: "openssl s_client -connect uat.habitclan.com:443 2>/dev/null | openssl x509 -noout -checkend 604800"
timeout: 10s
# Added per council feedback
- name: database_migration_status
command: "npx supabase db diff --linked | grep -q 'No schema changes'"
timeout: 30s
on_failure: warn # Warn if pending migrations exist
- name: cluster_capacity
command: "kubectl top nodes | awk 'NR>1 {if ($3 > 80 || $5 > 80) exit 1}'"
timeout: 15s
on_failure: warn # Warn if nodes > 80% CPU or memory
- name: pending_pod_restarts
command: "kubectl get pods -n habit-hub-uat -o json | jq '[.items[].status.containerStatuses[]?.restartCount // 0] | add' | xargs test 5 -gt"
timeout: 10s
on_failure: warn # Warn if pods have > 5 recent restarts
Note: Database migration and cluster capacity checks added per council recommendation to prevent deployment into unstable environments.
6. Environment Drift Detection
Weekly scheduled job to detect configuration drift:
drift_detection:
schedule: "0 6 * * 1" # Monday 6 AM
checks:
- helm_values_vs_deployed
- secret_rotation_status
- resource_quota_utilization
- node_version_consistency
reporting:
destination: linear_issue
severity: warning
Implementation Plan
Phase 1: Quality Gates (Week 1-2)
- Add security scanning to UAT pipeline
- Implement pre-deployment validation checks (including new db migration, cluster capacity checks)
- Add post-deployment regression tests
- Configure parallel execution for G2-G4
Phase 2: Rollback & Validation (Week 3-4)
- Implement automated rollback on failure with tightened thresholds
- Configure rolling deployment strategy (canary deferred to production)
- Add deployment metrics collection
Phase 3: Operational Excellence (Week 5-6)
- Implement drift detection
- Add deployment dashboards
- Configure alerting thresholds
Consequences
Positive
- Faster detection of deployment issues
- Reduced manual intervention for rollbacks
- Better confidence in deployment quality
- Clearer audit trail for compliance
Negative
- Increased pipeline complexity
- Pre-deployment checks add ~2-3 min latency
- More infrastructure to maintain
Risks
- Tightened thresholds (2% error rate, 800ms P95) may cause false positives
- Pre-deployment checks add latency
- Rollback automation requires robust monitoring
- Rolling deployment less safe than canary for production (addressed in future ADR)
Metrics
| Metric | Target | Current |
|---|---|---|
| Deployment Frequency | Daily | ~2/week |
| Lead Time for Changes | < 1 hour | ~2 hours |
| Change Failure Rate | < 5% | Unknown |
| Mean Time to Recovery | < 15 min | ~30 min |
| Pipeline Duration (UAT) | < 20 min | ~15 min |
| Rollback Success Rate | > 99% | N/A |
LLM Council Review
Review Date: 2026-01-18 Consensus: 0.82 (Strong Agreement)
Key Changes Based on Council Feedback
-
Gate Parallelization: G2, G3, G4 now run in parallel
- Reduces pipeline time by ~40%
- All must pass before proceeding to deployment
-
Rollback Thresholds Tightened:
- Error rate: 5% → 2%
- P95 latency: 2000ms → 800ms
- Detection window shortened: 5 min → 3 min
- Rationale: UAT has controlled traffic, can be more aggressive
-
Canary Deployment Removed for UAT:
- Insufficient traffic for meaningful statistical analysis
- False positives from low traffic would cause unnecessary rollbacks
- Replaced with rolling deployment + post-deployment validation
- Canary reserved for production (separate ADR)
-
Manual Approval Gate Removed:
- G7 removed for UAT to enable continuous deployment
- Manual approval remains for production only
-
New Pre-Deployment Checks Added:
- Database migration status verification
- Cluster capacity monitoring
- Pod restart anomaly detection
Council Dissent (Minority Opinion)
- One model advocated keeping canary with longer analysis windows
- Majority view: rolling deployment sufficient for UAT given traffic levels