ADR-009: Ephemeral Environment Infrastructure

Status

Implemented (UAT environment deployed January 2026)

Implementation Summary

The UAT environment has been successfully deployed to uat.habitclan.com with the following components:

Component	Implementation	Status
EKS Cluster	Kubernetes 1.34 in eu-west-1	✅ Running
Ingress	NGINX Ingress Controller via NLB	✅ Deployed
TLS	cert-manager with Let's Encrypt	✅ Configured
GitOps	ArgoCD with auto-sync	✅ Active
Secrets	External Secrets Operator + AWS Secrets Manager	✅ Syncing
Node Groups	On-demand (UAT) + Spot (PR previews)	✅ Configured

Decision Drivers

Cost ceiling: ~$400-500/month for non-production infrastructure (revised after LLM Council review)
Must mirror production deployment patterns
Must isolate auth and data between environments
Self-service for developers (PR-triggered)
Automated teardown to prevent cost leaks

Context

The Habit Hub project needs the ability to:

Spin up isolated test environments on-demand for UAT testing
Create A/B testing environments for feature comparison
Provide preview environments for pull requests
Tear down environments automatically after use

Current State

Terraform: VPC module complete, EKS/RDS modules empty
Kubernetes: Full base manifests, empty overlays
Helm: Chart with dev/production values
ArgoCD: GitOps configuration for staging/production
GitHub Actions: Build pipelines, incomplete deployment automation
Database: Supabase (PostgreSQL) with authentication

Decision

We will implement ephemeral environments using:

ArgoCD ApplicationSets with Pull Request Generator (GitOps consistency)
Kubernetes Namespaces for compute isolation
Supabase Database Branching for database isolation
Helm with per-environment values (single source of truth, no Kustomize)
Wildcard DNS + cert-manager for automatic SSL
OAuth2 Proxy for preview environment authentication

Environment Types

Type	Trigger	Lifespan	Database	Access
PR Preview	PR opened	Until PR closed + TTL	Supabase branch	Org members only
UAT	develop branch	Persistent	Dedicated Supabase project	Org members
A/B Test	Manual	Until experiment ends	Read-only replica	Internal
Load Test	Manual	Hours	Ephemeral branch	Internal

Architecture

                         ┌─────────────────────────────────────┐
                         │        AWS Secrets Manager          │
                         │  ┌─────────────┐ ┌─────────────┐   │
                         │  │ app-secrets │ │ supa-secrets│   │
                         │  └──────┬──────┘ └──────┬──────┘   │
                         └─────────┼───────────────┼───────────┘
                                   │               │
                                   ▼               ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EKS Cluster (eu-west-1)                       │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────────────────────────────────────────────────────┐   │
│  │            External Secrets Operator (IRSA)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │          NGINX Ingress Controller (NLB)                  │   │
│  │      uat.habitclan.com, api-uat.habitclan.com            │   │
│  │      *.preview.habithub.dev → OAuth2 Proxy               │   │
│  └──────────────────────────────────────────────────────────┘   │
│         │               │               │                        │
│  ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐                │
│  │ habit-hub-  │ │ habit-hub-  │ │ habit-hub-  │                │
│  │ production  │ │ uat         │ │ pr-123      │                │
│  │ (On-Demand) │ │ (On-Demand) │ │ (Spot)      │                │
│  └──────┬──────┘ └──────┬──────┘ └──────┬──────┘                │
│         │               │               │                        │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              cert-manager (Let's Encrypt)                │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────┼───────────────┼───────────────┼────────────────────────┘
          │               │               │
   ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
   │  Supabase   │ │  Supabase   │ │  Supabase   │
   │ Production  │ │ UAT Project │ │ PR Branch   │
   └─────────────┘ └─────────────┘ └─────────────┘

Implementation

Phase 1: Terraform Modules (Week 1)

Complete EKS module with:
- On-demand node group for production/UAT
- Spot instance node group for ephemeral (60-70% cost savings)
- Karpenter for dynamic scaling
IAM roles for service accounts (IRSA)
VPC with proper subnets

Phase 2: ArgoCD ApplicationSet (Week 1-2)

PR Generator for automatic preview environments
GitHub token secret for PR monitoring
Automatic cleanup on PR close

Phase 3: Security & Access Control (Week 2)

OAuth2 Proxy for preview authentication
NetworkPolicies for namespace isolation
Pod Security Standards (restricted)
ResourceQuotas and LimitRanges per namespace

Phase 4: Supabase Integration (Week 2-3)

GitHub Action for branch creation (parallel with image build)
Migration automation on new branches
Data seeding strategy (minimal seed for PR, anonymized for UAT)
Branch cleanup webhook

Phase 5: Cleanup Automation (Week 3)

TTL labels on ephemeral namespaces
Daily CronJob for orphan detection
Supabase branch cleanup API calls

Resource Quotas (Per Ephemeral Environment)

apiVersion: v1
kind: ResourceQuota
metadata:
  name: ephemeral-quota
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 4Gi
    limits.cpu: "4"
    limits.memory: 8Gi
    persistentvolumeclaims: "2"
    services.loadbalancers: "0"  # Force shared ingress

Network Policy

Critical: NetworkPolicy must explicitly allow DNS (port 53) or pods will fail to resolve external services. This was identified as a P0 issue during LLM Council review.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ephemeral-isolation
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ingress-nginx
  egress:
  # CRITICAL: Allow DNS resolution (P0 fix)
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Allow external HTTPS (Supabase, APIs)
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
          - 10.0.0.0/8
          - 172.16.0.0/12
          - 192.168.0.0/16
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 5432  # PostgreSQL/Supabase

Naming Conventions

Resource	Pattern	Example
Namespace	`habit-hub-{type}-{id}`	`habit-hub-pr-123`
Ingress Host	`{id}.preview.habithub.dev`	`pr-123.preview.habithub.dev`
Supabase Branch	`{type}-{id}`	`pr-123`
Helm Release	`{type}-{id}`	`pr-123`

Cost Estimation

Note: Cost estimates revised after LLM Council review. Original estimate of $280-300/mo was optimistic. Hidden costs (NAT data processing, ALB LCUs, EBS, CloudWatch) often push actual costs higher.

Base Infrastructure Costs

Component	Monthly Cost	Notes
EKS Control Plane	$73	Fixed
On-Demand Nodes (prod/UAT)	~$100-120	t3.medium x3, region-dependent
Spot Nodes (ephemeral)	~$50-150	Variable based on PR activity
NAT Gateway	~$45-100	Base $32 + data processing fees
ALB (shared)	~$25-50	Using `group.name` annotation
Supabase UAT Project	$25	Pro plan includes branches

Hidden Costs (Often Missed)

Component	Monthly Cost	Notes
EBS Storage	~$10-30	PVCs for stateful components
CloudWatch Logs	~$10-30	Log ingestion and storage
Secrets Manager	~$5-15	Per-secret charges
Route53	~$5-10	Hosted zones + DNS queries
NAT Data Processing	~$10-45	$0.045/GB for external traffic

Cost Summary

Scenario	Monthly Total	Notes
Optimistic	~$350/mo	Low PR activity, minimal data transfer
Typical	~$400-450/mo	10-15 concurrent PRs
Peak	~$500-550/mo	High PR activity, many concurrent envs

Cost Optimization Strategies

VPC Endpoints - Add for ECR/S3 to bypass NAT Gateway (saves ~$20-40/mo)
ALB Sharing - Use alb.ingress.kubernetes.io/group.name annotation (critical)
Single AZ for Spot - Force ephemeral nodes to one AZ (one NAT instead of three)
Aggressive Scale-Down - Cluster Autoscaler with 5-min unneeded time
StorageClass ReclaimPolicy - Set to Delete to prevent orphaned PVs

Alternatives Considered

1. Separate Clusters Per Environment

Rejected - Cost prohibitive. Each EKS cluster costs $73/month for control plane alone, plus node costs. With ephemeral needs, this would quickly exceed the $300/month budget.

2. GitHub Actions Direct Deploy

Rejected - Conflicts with GitOps model. Direct deploys bypass ArgoCD, breaking the single source of truth principle and making drift detection impossible.

3. Kustomize Overlays

Rejected - Helm already exists in the project. Adding Kustomize would create tool sprawl and confusion about which tool to use when. Helm values files provide sufficient environment customization.

4. vCluster for Isolation

Deferred to Phase 2 - Virtual clusters provide stronger isolation than namespaces but add complexity. Will evaluate if namespace isolation proves insufficient for security or resource contention.

5. Neon for Database Branching

Documented as contingency - If Supabase branching proves too slow (>5 min), Neon offers instant copy-on-write branches. However, it would require database migration and new integration work.

Consequences

Positive

Consistent GitOps deployment model - ArgoCD manages all environments, ensuring drift detection and automated reconciliation
Cost-efficient with Spot instances - 60-70% compute savings on ephemeral workloads with graceful interruption handling
Proper security isolation - NetworkPolicies prevent cross-namespace communication, Pod Security Standards enforce least privilege
Self-service for developers - PR preview environments spin up automatically, no manual intervention needed
Automatic cleanup - TTL labels and orphan detection prevent cost leaks from forgotten environments

Negative

Supabase branching adds 2-5 min to environment spin-up - Mitigated by starting branch creation in parallel with Docker image build
OAuth2 Proxy adds authentication complexity - All preview URLs require GitHub org membership, which may slow external stakeholder reviews
Orphan cleanup requires monitoring - Daily CronJob may miss some cases; need alerting at 70% quota utilization
Spot interruptions - Ephemeral workloads may be interrupted; requires multiple instance types and graceful shutdown handling

Security Considerations

Authentication

Preview URLs protected by OAuth2 Proxy requiring GitHub organization membership
UAT uses same OAuth2 Proxy with stricter team-based access
A/B tests restricted to internal IP ranges

Network Isolation

NetworkPolicies prevent cross-namespace pod communication
Egress restricted to external APIs only (Supabase, external services)
Internal traffic blocked except for shared ingress

Pod Security

Pod Security Standards set to "restricted" for all ephemeral namespaces
Containers run as non-root with dropped capabilities
Read-only root filesystem where possible

Secrets Management

Per-environment secrets via External Secrets Operator (ESO)
AWS Secrets Manager as secret backend (not SSM Parameter Store)
IRSA (IAM Roles for Service Accounts) for secure authentication
ClusterSecretStore pattern for cross-namespace access
Secrets auto-sync every 1 hour
No secrets in Git, no shared secrets between environments

Implemented Secrets Structure:

habit-hub/uat/app-secrets:
  - jwt-secret-key
  - admin-jwt-secret-key
  - jwt-refresh-secret-key
  - secret-key
  - encryption-key

habit-hub/uat/supabase-secrets:
  - database-url
  - supabase-url
  - supabase-anon-key
  - supabase-service-key

Database Access

A/B tests use read-only database credentials
Row Level Security (RLS) validated before any A/B test deployment
PR preview branches isolated from production data

Data Seeding Strategy

Environment Type	Data Approach
PR Preview	Minimal seed (~100 records) - just enough to demonstrate functionality
UAT	Anonymized production snapshot - realistic data volumes for testing
A/B Test	Production read-only (validated RLS) - real user data with strict access controls
Load Test	Synthetic generated data - high volume, no PII

Seeding Implementation

PR previews use a git-committed seed SQL file (db/seeds/minimal.sql)
UAT snapshots created weekly via pg_dump with anonymization script
A/B tests connect directly to production replica with RLS-enforced views
Load test data generated by Faker scripts at deployment time

Success Metrics

Metric	Target	Measurement
Environment spin-up time	<10 minutes	From PR open to URL accessible
Teardown success rate	>99%	Environments deleted within TTL
Maximum concurrent PR environments	20	Without resource quota exhaustion
Cost per PR environment	<$5/day	Spot instances + shared resources
Orphan detection rate	100%	No orphans older than 48 hours

Monitoring and Alerting

Required Alerts

Quota exhaustion warning - Alert at 70% namespace quota utilization
Orphan environment detected - Namespace exists without matching PR
Spot interruption rate - Alert if >10% of pods interrupted in 1 hour
Supabase branch creation failure - Alert on any branch creation error
Environment spin-up timeout - Alert if environment not ready in 15 minutes

Dashboards

Ephemeral environment count by type (Grafana)
Resource utilization by namespace (Grafana)
Cost tracking by environment (AWS Cost Explorer tags)

ADR-001: PostgreSQL for Native Habits Storage - Establishes Supabase as the database platform
ADR-004: Circuit Breaker for External APIs - Applies to Habitify API calls in ephemeral environments

LLM Council Review

This ADR was reviewed by a multi-model LLM Council (GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, Grok 4) on 2026-01-17.

Verdict: B+ / Approved with Improvements

Key findings addressed:

Pod Security Standards (P0) - Added explicit PSS enforcement via namespace labels
DNS in NetworkPolicy (P0) - NetworkPolicy now explicitly allows UDP/TCP port 53
ALB Sharing (P0) - Added alb.ingress.kubernetes.io/group.name annotation
Cost Estimates (P0) - Revised from $280-300/mo to realistic $350-550/mo range

Recommendations for future iterations:

Consider Karpenter over Cluster Autoscaler for better spot management
Add VPC endpoints for ECR/S3 to reduce NAT costs
Implement ArgoCD Sync Waves for dependency ordering
Add Kyverno/OPA policies for additional guardrails

Lessons Learned (Implementation)

What Worked Well

External Secrets Operator with AWS Secrets Manager: GitOps-friendly secrets management that eliminated manual secret patching. Secrets auto-sync every hour.
NGINX Ingress with NLB: Simple, reliable ingress setup. The NLB provides stable external access without the complexity of ALB Ingress Controller.
cert-manager with Let's Encrypt: Automatic TLS certificate provisioning once DNS is configured. HTTP-01 challenges work seamlessly with NGINX Ingress.
ArgoCD with auto-sync: Changes pushed to Git automatically deploy. Self-healing ensures drift is corrected.

Challenges Encountered

Frontend nginx permissions: Running nginx as non-root (UID 101) required:
- Listening on port 8080 instead of 80
- emptyDir volumes for /var/cache/nginx and /var/run
- Security context with runAsUser: 101
Docker build context in monorepo: Frontend Dockerfile needed the full repository context for workspace dependencies, not just the app directory.
React peer dependency conflicts: Monorepo with web and mobile workspaces required --legacy-peer-deps for React 18/19 compatibility.
Backend environment validation: The backend only accepts specific environment values (local, development, testing, production). UAT uses production.
IRSA trust policy precision: The OIDC provider ID must match exactly in the trust policy condition.

Key Implementation Decisions

Decision	Rationale
AWS Secrets Manager over SSM	Better JSON support, native secret rotation, cleaner API
NLB over ALB	Simpler setup, lower cost, sufficient for our needs
ClusterSecretStore pattern	Allows secrets to be accessed from any namespace
Port 8080 for frontend	Required for non-root nginx operation

Implementation Files

File	Purpose
`k8s/external-secrets/cluster-secret-store.yaml`	AWS Secrets Manager connection
`k8s/cert-manager/cluster-issuer.yaml`	Let's Encrypt issuers
`k8s/argocd-application-uat.yaml`	UAT ArgoCD Application
`helm/habit-hub/templates/external-secrets.yaml`	ExternalSecret resources
`helm/habit-hub/values-uat.yaml`	UAT environment configuration

Status​

Implementation Summary​

Decision Drivers​

Context​

Current State​

Decision​

Environment Types​

Architecture​

Implementation​

Phase 1: Terraform Modules (Week 1)​

Phase 2: ArgoCD ApplicationSet (Week 1-2)​

Phase 3: Security & Access Control (Week 2)​

Phase 4: Supabase Integration (Week 2-3)​

Phase 5: Cleanup Automation (Week 3)​

Resource Quotas (Per Ephemeral Environment)​

Network Policy​

Naming Conventions​

Cost Estimation​

Base Infrastructure Costs​

Hidden Costs (Often Missed)​

Cost Summary​

Cost Optimization Strategies​

Alternatives Considered​

1. Separate Clusters Per Environment​

2. GitHub Actions Direct Deploy​

3. Kustomize Overlays​

4. vCluster for Isolation​

5. Neon for Database Branching​

Consequences​

Positive​

Negative​

Security Considerations​

Authentication​

Network Isolation​

Pod Security​

Secrets Management​

Database Access​

Data Seeding Strategy​

Seeding Implementation​

Success Metrics​

Monitoring and Alerting​

Required Alerts​

Dashboards​

Related ADRs​

LLM Council Review​

Verdict: B+ / Approved with Improvements​

Lessons Learned (Implementation)​

What Worked Well​

Challenges Encountered​

Key Implementation Decisions​

Implementation Files​

References​