ADR-009: Ephemeral Environment Infrastructure
Status
Implemented (UAT environment deployed January 2026)
Implementation Summary
The UAT environment has been successfully deployed to uat.habitclan.com with the following components:
| Component | Implementation | Status |
|---|---|---|
| EKS Cluster | Kubernetes 1.34 in eu-west-1 | ✅ Running |
| Ingress | NGINX Ingress Controller via NLB | ✅ Deployed |
| TLS | cert-manager with Let's Encrypt | ✅ Configured |
| GitOps | ArgoCD with auto-sync | ✅ Active |
| Secrets | External Secrets Operator + AWS Secrets Manager | ✅ Syncing |
| Node Groups | On-demand (UAT) + Spot (PR previews) | ✅ Configured |
Decision Drivers
- Cost ceiling: ~$400-500/month for non-production infrastructure (revised after LLM Council review)
- Must mirror production deployment patterns
- Must isolate auth and data between environments
- Self-service for developers (PR-triggered)
- Automated teardown to prevent cost leaks
Context
The Habit Hub project needs the ability to:
- Spin up isolated test environments on-demand for UAT testing
- Create A/B testing environments for feature comparison
- Provide preview environments for pull requests
- Tear down environments automatically after use
Current State
- Terraform: VPC module complete, EKS/RDS modules empty
- Kubernetes: Full base manifests, empty overlays
- Helm: Chart with dev/production values
- ArgoCD: GitOps configuration for staging/production
- GitHub Actions: Build pipelines, incomplete deployment automation
- Database: Supabase (PostgreSQL) with authentication
Decision
We will implement ephemeral environments using:
- ArgoCD ApplicationSets with Pull Request Generator (GitOps consistency)
- Kubernetes Namespaces for compute isolation
- Supabase Database Branching for database isolation
- Helm with per-environment values (single source of truth, no Kustomize)
- Wildcard DNS + cert-manager for automatic SSL
- OAuth2 Proxy for preview environment authentication
Environment Types
| Type | Trigger | Lifespan | Database | Access |
|---|---|---|---|---|
| PR Preview | PR opened | Until PR closed + TTL | Supabase branch | Org members only |
| UAT | develop branch | Persistent | Dedicated Supabase project | Org members |
| A/B Test | Manual | Until experiment ends | Read-only replica | Internal |
| Load Test | Manual | Hours | Ephemeral branch | Internal |
Architecture
┌─────────────────────────────────────┐
│ AWS Secrets Manager │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ app-secrets │ │ supa-secrets│ │
│ └──────┬──────┘ └──────┬──────┘ │
└─────────┼───────────────┼───────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ EKS Cluster (eu-west-1) │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ External Secrets Operator (IRSA) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ NGINX Ingress Controller (NLB) │ │
│ │ uat.habitclan.com, api-uat.habitclan.com │ │
│ │ *.preview.habithub.dev → OAuth2 Proxy │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ habit-hub- │ │ habit-hub- │ │ habit-hub- │ │
│ │ production │ │ uat │ │ pr-123 │ │
│ │ (On-Demand) │ │ (On-Demand) │ │ (Spot) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ cert-manager (Let's Encrypt) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────┼───────────────┼───────────────┼────────────────────────┘
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Supabase │ │ Supabase │ │ Supabase │
│ Production │ │ UAT Project │ │ PR Branch │
└─────────────┘ └─────────────┘ └─────────────┘
Implementation
Phase 1: Terraform Modules (Week 1)
- Complete EKS module with:
- On-demand node group for production/UAT
- Spot instance node group for ephemeral (60-70% cost savings)
- Karpenter for dynamic scaling
- IAM roles for service accounts (IRSA)
- VPC with proper subnets
Phase 2: ArgoCD ApplicationSet (Week 1-2)
- PR Generator for automatic preview environments
- GitHub token secret for PR monitoring
- Automatic cleanup on PR close
Phase 3: Security & Access Control (Week 2)
- OAuth2 Proxy for preview authentication
- NetworkPolicies for namespace isolation
- Pod Security Standards (restricted)
- ResourceQuotas and LimitRanges per namespace
Phase 4: Supabase Integration (Week 2-3)
- GitHub Action for branch creation (parallel with image build)
- Migration automation on new branches
- Data seeding strategy (minimal seed for PR, anonymized for UAT)
- Branch cleanup webhook
Phase 5: Cleanup Automation (Week 3)
- TTL labels on ephemeral namespaces
- Daily CronJob for orphan detection
- Supabase branch cleanup API calls
Resource Quotas (Per Ephemeral Environment)
apiVersion: v1
kind: ResourceQuota
metadata:
name: ephemeral-quota
spec:
hard:
requests.cpu: "2"
requests.memory: 4Gi
limits.cpu: "4"
limits.memory: 8Gi
persistentvolumeclaims: "2"
services.loadbalancers: "0" # Force shared ingress
Network Policy
Critical: NetworkPolicy must explicitly allow DNS (port 53) or pods will fail to resolve external services. This was identified as a P0 issue during LLM Council review.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ephemeral-isolation
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
egress:
# CRITICAL: Allow DNS resolution (P0 fix)
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow external HTTPS (Supabase, APIs)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 5432 # PostgreSQL/Supabase
Naming Conventions
| Resource | Pattern | Example |
|---|---|---|
| Namespace | habit-hub-{type}-{id} | habit-hub-pr-123 |
| Ingress Host | {id}.preview.habithub.dev | pr-123.preview.habithub.dev |
| Supabase Branch | {type}-{id} | pr-123 |
| Helm Release | {type}-{id} | pr-123 |
Cost Estimation
Note: Cost estimates revised after LLM Council review. Original estimate of $280-300/mo was optimistic. Hidden costs (NAT data processing, ALB LCUs, EBS, CloudWatch) often push actual costs higher.
Base Infrastructure Costs
| Component | Monthly Cost | Notes |
|---|---|---|
| EKS Control Plane | $73 | Fixed |
| On-Demand Nodes (prod/UAT) | ~$100-120 | t3.medium x3, region-dependent |
| Spot Nodes (ephemeral) | ~$50-150 | Variable based on PR activity |
| NAT Gateway | ~$45-100 | Base $32 + data processing fees |
| ALB (shared) | ~$25-50 | Using group.name annotation |
| Supabase UAT Project | $25 | Pro plan includes branches |
Hidden Costs (Often Missed)
| Component | Monthly Cost | Notes |
|---|---|---|
| EBS Storage | ~$10-30 | PVCs for stateful components |
| CloudWatch Logs | ~$10-30 | Log ingestion and storage |
| Secrets Manager | ~$5-15 | Per-secret charges |
| Route53 | ~$5-10 | Hosted zones + DNS queries |
| NAT Data Processing | ~$10-45 | $0.045/GB for external traffic |
Cost Summary
| Scenario | Monthly Total | Notes |
|---|---|---|
| Optimistic | ~$350/mo | Low PR activity, minimal data transfer |
| Typical | ~$400-450/mo | 10-15 concurrent PRs |
| Peak | ~$500-550/mo | High PR activity, many concurrent envs |
Cost Optimization Strategies
- VPC Endpoints - Add for ECR/S3 to bypass NAT Gateway (saves ~$20-40/mo)
- ALB Sharing - Use
alb.ingress.kubernetes.io/group.nameannotation (critical) - Single AZ for Spot - Force ephemeral nodes to one AZ (one NAT instead of three)
- Aggressive Scale-Down - Cluster Autoscaler with 5-min unneeded time
- StorageClass ReclaimPolicy - Set to
Deleteto prevent orphaned PVs
Alternatives Considered
1. Separate Clusters Per Environment
Rejected - Cost prohibitive. Each EKS cluster costs $73/month for control plane alone, plus node costs. With ephemeral needs, this would quickly exceed the $300/month budget.
2. GitHub Actions Direct Deploy
Rejected - Conflicts with GitOps model. Direct deploys bypass ArgoCD, breaking the single source of truth principle and making drift detection impossible.
3. Kustomize Overlays
Rejected - Helm already exists in the project. Adding Kustomize would create tool sprawl and confusion about which tool to use when. Helm values files provide sufficient environment customization.
4. vCluster for Isolation
Deferred to Phase 2 - Virtual clusters provide stronger isolation than namespaces but add complexity. Will evaluate if namespace isolation proves insufficient for security or resource contention.
5. Neon for Database Branching
Documented as contingency - If Supabase branching proves too slow (>5 min), Neon offers instant copy-on-write branches. However, it would require database migration and new integration work.
Consequences
Positive
- Consistent GitOps deployment model - ArgoCD manages all environments, ensuring drift detection and automated reconciliation
- Cost-efficient with Spot instances - 60-70% compute savings on ephemeral workloads with graceful interruption handling
- Proper security isolation - NetworkPolicies prevent cross-namespace communication, Pod Security Standards enforce least privilege
- Self-service for developers - PR preview environments spin up automatically, no manual intervention needed
- Automatic cleanup - TTL labels and orphan detection prevent cost leaks from forgotten environments
Negative
- Supabase branching adds 2-5 min to environment spin-up - Mitigated by starting branch creation in parallel with Docker image build
- OAuth2 Proxy adds authentication complexity - All preview URLs require GitHub org membership, which may slow external stakeholder reviews
- Orphan cleanup requires monitoring - Daily CronJob may miss some cases; need alerting at 70% quota utilization
- Spot interruptions - Ephemeral workloads may be interrupted; requires multiple instance types and graceful shutdown handling
Security Considerations
Authentication
- Preview URLs protected by OAuth2 Proxy requiring GitHub organization membership
- UAT uses same OAuth2 Proxy with stricter team-based access
- A/B tests restricted to internal IP ranges
Network Isolation
- NetworkPolicies prevent cross-namespace pod communication
- Egress restricted to external APIs only (Supabase, external services)
- Internal traffic blocked except for shared ingress
Pod Security
- Pod Security Standards set to "restricted" for all ephemeral namespaces
- Containers run as non-root with dropped capabilities
- Read-only root filesystem where possible
Secrets Management
- Per-environment secrets via External Secrets Operator (ESO)
- AWS Secrets Manager as secret backend (not SSM Parameter Store)
- IRSA (IAM Roles for Service Accounts) for secure authentication
- ClusterSecretStore pattern for cross-namespace access
- Secrets auto-sync every 1 hour
- No secrets in Git, no shared secrets between environments
Implemented Secrets Structure:
habit-hub/uat/app-secrets:
- jwt-secret-key
- admin-jwt-secret-key
- jwt-refresh-secret-key
- secret-key
- encryption-key
habit-hub/uat/supabase-secrets:
- database-url
- supabase-url
- supabase-anon-key
- supabase-service-key
Database Access
- A/B tests use read-only database credentials
- Row Level Security (RLS) validated before any A/B test deployment
- PR preview branches isolated from production data
Data Seeding Strategy
| Environment Type | Data Approach |
|---|---|
| PR Preview | Minimal seed (~100 records) - just enough to demonstrate functionality |
| UAT | Anonymized production snapshot - realistic data volumes for testing |
| A/B Test | Production read-only (validated RLS) - real user data with strict access controls |
| Load Test | Synthetic generated data - high volume, no PII |
Seeding Implementation
- PR previews use a git-committed seed SQL file (
db/seeds/minimal.sql) - UAT snapshots created weekly via pg_dump with anonymization script
- A/B tests connect directly to production replica with RLS-enforced views
- Load test data generated by Faker scripts at deployment time
Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Environment spin-up time | <10 minutes | From PR open to URL accessible |
| Teardown success rate | >99% | Environments deleted within TTL |
| Maximum concurrent PR environments | 20 | Without resource quota exhaustion |
| Cost per PR environment | <$5/day | Spot instances + shared resources |
| Orphan detection rate | 100% | No orphans older than 48 hours |
Monitoring and Alerting
Required Alerts
- Quota exhaustion warning - Alert at 70% namespace quota utilization
- Orphan environment detected - Namespace exists without matching PR
- Spot interruption rate - Alert if >10% of pods interrupted in 1 hour
- Supabase branch creation failure - Alert on any branch creation error
- Environment spin-up timeout - Alert if environment not ready in 15 minutes
Dashboards
- Ephemeral environment count by type (Grafana)
- Resource utilization by namespace (Grafana)
- Cost tracking by environment (AWS Cost Explorer tags)
Related ADRs
- ADR-001: PostgreSQL for Native Habits Storage - Establishes Supabase as the database platform
- ADR-004: Circuit Breaker for External APIs - Applies to Habitify API calls in ephemeral environments
LLM Council Review
This ADR was reviewed by a multi-model LLM Council (GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, Grok 4) on 2026-01-17.
Verdict: B+ / Approved with Improvements
Key findings addressed:
- Pod Security Standards (P0) - Added explicit PSS enforcement via namespace labels
- DNS in NetworkPolicy (P0) - NetworkPolicy now explicitly allows UDP/TCP port 53
- ALB Sharing (P0) - Added
alb.ingress.kubernetes.io/group.nameannotation - Cost Estimates (P0) - Revised from $280-300/mo to realistic $350-550/mo range
Recommendations for future iterations:
- Consider Karpenter over Cluster Autoscaler for better spot management
- Add VPC endpoints for ECR/S3 to reduce NAT costs
- Implement ArgoCD Sync Waves for dependency ordering
- Add Kyverno/OPA policies for additional guardrails
Lessons Learned (Implementation)
What Worked Well
-
External Secrets Operator with AWS Secrets Manager: GitOps-friendly secrets management that eliminated manual secret patching. Secrets auto-sync every hour.
-
NGINX Ingress with NLB: Simple, reliable ingress setup. The NLB provides stable external access without the complexity of ALB Ingress Controller.
-
cert-manager with Let's Encrypt: Automatic TLS certificate provisioning once DNS is configured. HTTP-01 challenges work seamlessly with NGINX Ingress.
-
ArgoCD with auto-sync: Changes pushed to Git automatically deploy. Self-healing ensures drift is corrected.
Challenges Encountered
-
Frontend nginx permissions: Running nginx as non-root (UID 101) required:
- Listening on port 8080 instead of 80
- emptyDir volumes for
/var/cache/nginxand/var/run - Security context with
runAsUser: 101
-
Docker build context in monorepo: Frontend Dockerfile needed the full repository context for workspace dependencies, not just the app directory.
-
React peer dependency conflicts: Monorepo with web and mobile workspaces required
--legacy-peer-depsfor React 18/19 compatibility. -
Backend environment validation: The backend only accepts specific environment values (
local,development,testing,production). UAT usesproduction. -
IRSA trust policy precision: The OIDC provider ID must match exactly in the trust policy condition.
Key Implementation Decisions
| Decision | Rationale |
|---|---|
| AWS Secrets Manager over SSM | Better JSON support, native secret rotation, cleaner API |
| NLB over ALB | Simpler setup, lower cost, sufficient for our needs |
| ClusterSecretStore pattern | Allows secrets to be accessed from any namespace |
| Port 8080 for frontend | Required for non-root nginx operation |
Implementation Files
| File | Purpose |
|---|---|
k8s/external-secrets/cluster-secret-store.yaml | AWS Secrets Manager connection |
k8s/cert-manager/cluster-issuer.yaml | Let's Encrypt issuers |
k8s/argocd-application-uat.yaml | UAT ArgoCD Application |
helm/habit-hub/templates/external-secrets.yaml | ExternalSecret resources |
helm/habit-hub/values-uat.yaml | UAT environment configuration |