Skip to main content

ADR-009: Ephemeral Environment Infrastructure

Status

Implemented (UAT environment deployed January 2026)

Implementation Summary

The UAT environment has been successfully deployed to uat.habitclan.com with the following components:

ComponentImplementationStatus
EKS ClusterKubernetes 1.34 in eu-west-1✅ Running
IngressNGINX Ingress Controller via NLB✅ Deployed
TLScert-manager with Let's Encrypt✅ Configured
GitOpsArgoCD with auto-sync✅ Active
SecretsExternal Secrets Operator + AWS Secrets Manager✅ Syncing
Node GroupsOn-demand (UAT) + Spot (PR previews)✅ Configured

Decision Drivers

  • Cost ceiling: ~$400-500/month for non-production infrastructure (revised after LLM Council review)
  • Must mirror production deployment patterns
  • Must isolate auth and data between environments
  • Self-service for developers (PR-triggered)
  • Automated teardown to prevent cost leaks

Context

The Habit Hub project needs the ability to:

  1. Spin up isolated test environments on-demand for UAT testing
  2. Create A/B testing environments for feature comparison
  3. Provide preview environments for pull requests
  4. Tear down environments automatically after use

Current State

  • Terraform: VPC module complete, EKS/RDS modules empty
  • Kubernetes: Full base manifests, empty overlays
  • Helm: Chart with dev/production values
  • ArgoCD: GitOps configuration for staging/production
  • GitHub Actions: Build pipelines, incomplete deployment automation
  • Database: Supabase (PostgreSQL) with authentication

Decision

We will implement ephemeral environments using:

  1. ArgoCD ApplicationSets with Pull Request Generator (GitOps consistency)
  2. Kubernetes Namespaces for compute isolation
  3. Supabase Database Branching for database isolation
  4. Helm with per-environment values (single source of truth, no Kustomize)
  5. Wildcard DNS + cert-manager for automatic SSL
  6. OAuth2 Proxy for preview environment authentication

Environment Types

TypeTriggerLifespanDatabaseAccess
PR PreviewPR openedUntil PR closed + TTLSupabase branchOrg members only
UATdevelop branchPersistentDedicated Supabase projectOrg members
A/B TestManualUntil experiment endsRead-only replicaInternal
Load TestManualHoursEphemeral branchInternal

Architecture

                         ┌─────────────────────────────────────┐
│ AWS Secrets Manager │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ app-secrets │ │ supa-secrets│ │
│ └──────┬──────┘ └──────┬──────┘ │
└─────────┼───────────────┼───────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ EKS Cluster (eu-west-1) │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ External Secrets Operator (IRSA) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ NGINX Ingress Controller (NLB) │ │
│ │ uat.habitclan.com, api-uat.habitclan.com │ │
│ │ *.preview.habithub.dev → OAuth2 Proxy │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ habit-hub- │ │ habit-hub- │ │ habit-hub- │ │
│ │ production │ │ uat │ │ pr-123 │ │
│ │ (On-Demand) │ │ (On-Demand) │ │ (Spot) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ cert-manager (Let's Encrypt) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────┼───────────────┼───────────────┼────────────────────────┘
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Supabase │ │ Supabase │ │ Supabase │
│ Production │ │ UAT Project │ │ PR Branch │
└─────────────┘ └─────────────┘ └─────────────┘

Implementation

Phase 1: Terraform Modules (Week 1)

  • Complete EKS module with:
    • On-demand node group for production/UAT
    • Spot instance node group for ephemeral (60-70% cost savings)
    • Karpenter for dynamic scaling
  • IAM roles for service accounts (IRSA)
  • VPC with proper subnets

Phase 2: ArgoCD ApplicationSet (Week 1-2)

  • PR Generator for automatic preview environments
  • GitHub token secret for PR monitoring
  • Automatic cleanup on PR close

Phase 3: Security & Access Control (Week 2)

  • OAuth2 Proxy for preview authentication
  • NetworkPolicies for namespace isolation
  • Pod Security Standards (restricted)
  • ResourceQuotas and LimitRanges per namespace

Phase 4: Supabase Integration (Week 2-3)

  • GitHub Action for branch creation (parallel with image build)
  • Migration automation on new branches
  • Data seeding strategy (minimal seed for PR, anonymized for UAT)
  • Branch cleanup webhook

Phase 5: Cleanup Automation (Week 3)

  • TTL labels on ephemeral namespaces
  • Daily CronJob for orphan detection
  • Supabase branch cleanup API calls

Resource Quotas (Per Ephemeral Environment)

apiVersion: v1
kind: ResourceQuota
metadata:
name: ephemeral-quota
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
limits.cpu: "4"
limits.memory: 8Gi
persistentvolumeclaims: "2"
services.loadbalancers: "0" # Force shared ingress

Network Policy

Critical: NetworkPolicy must explicitly allow DNS (port 53) or pods will fail to resolve external services. This was identified as a P0 issue during LLM Council review.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ephemeral-isolation
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
egress:
# CRITICAL: Allow DNS resolution (P0 fix)
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow external HTTPS (Supabase, APIs)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 5432 # PostgreSQL/Supabase

Naming Conventions

ResourcePatternExample
Namespacehabit-hub-{type}-{id}habit-hub-pr-123
Ingress Host{id}.preview.habithub.devpr-123.preview.habithub.dev
Supabase Branch{type}-{id}pr-123
Helm Release{type}-{id}pr-123

Cost Estimation

Note: Cost estimates revised after LLM Council review. Original estimate of $280-300/mo was optimistic. Hidden costs (NAT data processing, ALB LCUs, EBS, CloudWatch) often push actual costs higher.

Base Infrastructure Costs

ComponentMonthly CostNotes
EKS Control Plane$73Fixed
On-Demand Nodes (prod/UAT)~$100-120t3.medium x3, region-dependent
Spot Nodes (ephemeral)~$50-150Variable based on PR activity
NAT Gateway~$45-100Base $32 + data processing fees
ALB (shared)~$25-50Using group.name annotation
Supabase UAT Project$25Pro plan includes branches

Hidden Costs (Often Missed)

ComponentMonthly CostNotes
EBS Storage~$10-30PVCs for stateful components
CloudWatch Logs~$10-30Log ingestion and storage
Secrets Manager~$5-15Per-secret charges
Route53~$5-10Hosted zones + DNS queries
NAT Data Processing~$10-45$0.045/GB for external traffic

Cost Summary

ScenarioMonthly TotalNotes
Optimistic~$350/moLow PR activity, minimal data transfer
Typical~$400-450/mo10-15 concurrent PRs
Peak~$500-550/moHigh PR activity, many concurrent envs

Cost Optimization Strategies

  1. VPC Endpoints - Add for ECR/S3 to bypass NAT Gateway (saves ~$20-40/mo)
  2. ALB Sharing - Use alb.ingress.kubernetes.io/group.name annotation (critical)
  3. Single AZ for Spot - Force ephemeral nodes to one AZ (one NAT instead of three)
  4. Aggressive Scale-Down - Cluster Autoscaler with 5-min unneeded time
  5. StorageClass ReclaimPolicy - Set to Delete to prevent orphaned PVs

Alternatives Considered

1. Separate Clusters Per Environment

Rejected - Cost prohibitive. Each EKS cluster costs $73/month for control plane alone, plus node costs. With ephemeral needs, this would quickly exceed the $300/month budget.

2. GitHub Actions Direct Deploy

Rejected - Conflicts with GitOps model. Direct deploys bypass ArgoCD, breaking the single source of truth principle and making drift detection impossible.

3. Kustomize Overlays

Rejected - Helm already exists in the project. Adding Kustomize would create tool sprawl and confusion about which tool to use when. Helm values files provide sufficient environment customization.

4. vCluster for Isolation

Deferred to Phase 2 - Virtual clusters provide stronger isolation than namespaces but add complexity. Will evaluate if namespace isolation proves insufficient for security or resource contention.

5. Neon for Database Branching

Documented as contingency - If Supabase branching proves too slow (>5 min), Neon offers instant copy-on-write branches. However, it would require database migration and new integration work.

Consequences

Positive

  • Consistent GitOps deployment model - ArgoCD manages all environments, ensuring drift detection and automated reconciliation
  • Cost-efficient with Spot instances - 60-70% compute savings on ephemeral workloads with graceful interruption handling
  • Proper security isolation - NetworkPolicies prevent cross-namespace communication, Pod Security Standards enforce least privilege
  • Self-service for developers - PR preview environments spin up automatically, no manual intervention needed
  • Automatic cleanup - TTL labels and orphan detection prevent cost leaks from forgotten environments

Negative

  • Supabase branching adds 2-5 min to environment spin-up - Mitigated by starting branch creation in parallel with Docker image build
  • OAuth2 Proxy adds authentication complexity - All preview URLs require GitHub org membership, which may slow external stakeholder reviews
  • Orphan cleanup requires monitoring - Daily CronJob may miss some cases; need alerting at 70% quota utilization
  • Spot interruptions - Ephemeral workloads may be interrupted; requires multiple instance types and graceful shutdown handling

Security Considerations

Authentication

  • Preview URLs protected by OAuth2 Proxy requiring GitHub organization membership
  • UAT uses same OAuth2 Proxy with stricter team-based access
  • A/B tests restricted to internal IP ranges

Network Isolation

  • NetworkPolicies prevent cross-namespace pod communication
  • Egress restricted to external APIs only (Supabase, external services)
  • Internal traffic blocked except for shared ingress

Pod Security

  • Pod Security Standards set to "restricted" for all ephemeral namespaces
  • Containers run as non-root with dropped capabilities
  • Read-only root filesystem where possible

Secrets Management

  • Per-environment secrets via External Secrets Operator (ESO)
  • AWS Secrets Manager as secret backend (not SSM Parameter Store)
  • IRSA (IAM Roles for Service Accounts) for secure authentication
  • ClusterSecretStore pattern for cross-namespace access
  • Secrets auto-sync every 1 hour
  • No secrets in Git, no shared secrets between environments

Implemented Secrets Structure:

habit-hub/uat/app-secrets:
- jwt-secret-key
- admin-jwt-secret-key
- jwt-refresh-secret-key
- secret-key
- encryption-key

habit-hub/uat/supabase-secrets:
- database-url
- supabase-url
- supabase-anon-key
- supabase-service-key

Database Access

  • A/B tests use read-only database credentials
  • Row Level Security (RLS) validated before any A/B test deployment
  • PR preview branches isolated from production data

Data Seeding Strategy

Environment TypeData Approach
PR PreviewMinimal seed (~100 records) - just enough to demonstrate functionality
UATAnonymized production snapshot - realistic data volumes for testing
A/B TestProduction read-only (validated RLS) - real user data with strict access controls
Load TestSynthetic generated data - high volume, no PII

Seeding Implementation

  • PR previews use a git-committed seed SQL file (db/seeds/minimal.sql)
  • UAT snapshots created weekly via pg_dump with anonymization script
  • A/B tests connect directly to production replica with RLS-enforced views
  • Load test data generated by Faker scripts at deployment time

Success Metrics

MetricTargetMeasurement
Environment spin-up time<10 minutesFrom PR open to URL accessible
Teardown success rate>99%Environments deleted within TTL
Maximum concurrent PR environments20Without resource quota exhaustion
Cost per PR environment<$5/daySpot instances + shared resources
Orphan detection rate100%No orphans older than 48 hours

Monitoring and Alerting

Required Alerts

  1. Quota exhaustion warning - Alert at 70% namespace quota utilization
  2. Orphan environment detected - Namespace exists without matching PR
  3. Spot interruption rate - Alert if >10% of pods interrupted in 1 hour
  4. Supabase branch creation failure - Alert on any branch creation error
  5. Environment spin-up timeout - Alert if environment not ready in 15 minutes

Dashboards

  • Ephemeral environment count by type (Grafana)
  • Resource utilization by namespace (Grafana)
  • Cost tracking by environment (AWS Cost Explorer tags)
  • ADR-001: PostgreSQL for Native Habits Storage - Establishes Supabase as the database platform
  • ADR-004: Circuit Breaker for External APIs - Applies to Habitify API calls in ephemeral environments

LLM Council Review

This ADR was reviewed by a multi-model LLM Council (GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, Grok 4) on 2026-01-17.

Verdict: B+ / Approved with Improvements

Key findings addressed:

  1. Pod Security Standards (P0) - Added explicit PSS enforcement via namespace labels
  2. DNS in NetworkPolicy (P0) - NetworkPolicy now explicitly allows UDP/TCP port 53
  3. ALB Sharing (P0) - Added alb.ingress.kubernetes.io/group.name annotation
  4. Cost Estimates (P0) - Revised from $280-300/mo to realistic $350-550/mo range

Recommendations for future iterations:

  • Consider Karpenter over Cluster Autoscaler for better spot management
  • Add VPC endpoints for ECR/S3 to reduce NAT costs
  • Implement ArgoCD Sync Waves for dependency ordering
  • Add Kyverno/OPA policies for additional guardrails

Lessons Learned (Implementation)

What Worked Well

  1. External Secrets Operator with AWS Secrets Manager: GitOps-friendly secrets management that eliminated manual secret patching. Secrets auto-sync every hour.

  2. NGINX Ingress with NLB: Simple, reliable ingress setup. The NLB provides stable external access without the complexity of ALB Ingress Controller.

  3. cert-manager with Let's Encrypt: Automatic TLS certificate provisioning once DNS is configured. HTTP-01 challenges work seamlessly with NGINX Ingress.

  4. ArgoCD with auto-sync: Changes pushed to Git automatically deploy. Self-healing ensures drift is corrected.

Challenges Encountered

  1. Frontend nginx permissions: Running nginx as non-root (UID 101) required:

    • Listening on port 8080 instead of 80
    • emptyDir volumes for /var/cache/nginx and /var/run
    • Security context with runAsUser: 101
  2. Docker build context in monorepo: Frontend Dockerfile needed the full repository context for workspace dependencies, not just the app directory.

  3. React peer dependency conflicts: Monorepo with web and mobile workspaces required --legacy-peer-deps for React 18/19 compatibility.

  4. Backend environment validation: The backend only accepts specific environment values (local, development, testing, production). UAT uses production.

  5. IRSA trust policy precision: The OIDC provider ID must match exactly in the trust policy condition.

Key Implementation Decisions

DecisionRationale
AWS Secrets Manager over SSMBetter JSON support, native secret rotation, cleaner API
NLB over ALBSimpler setup, lower cost, sufficient for our needs
ClusterSecretStore patternAllows secrets to be accessed from any namespace
Port 8080 for frontendRequired for non-root nginx operation

Implementation Files

FilePurpose
k8s/external-secrets/cluster-secret-store.yamlAWS Secrets Manager connection
k8s/cert-manager/cluster-issuer.yamlLet's Encrypt issuers
k8s/argocd-application-uat.yamlUAT ArgoCD Application
helm/habit-hub/templates/external-secrets.yamlExternalSecret resources
helm/habit-hub/values-uat.yamlUAT environment configuration

References