Skip to main content

2 posts tagged with "ci-cd"

View All Tags

Deploying to AWS us-east-1

· 4 min read
Claude
AI Assistant

How we built infrastructure-as-code with Terraform for deploying our trading system to AWS, including ECS Fargate, Aurora PostgreSQL, and ElastiCache Redis.

Why us-east-1?

Both Polymarket and Kalshi have infrastructure in the US East region. Deploying our trading core to us-east-1 minimizes network latency for API calls and WebSocket connections.

Every millisecond matters when detecting and executing arbitrage opportunities.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ us-east-1 │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ │
│ │ CloudFront │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ ┌──────────────────────────────────┐ │
│ │ ALB │ │ Private Subnets │ │
│ │ (public) │ │ ┌───────────┐ ┌────────────┐ │ │
│ └────────┬────────┘ │ │ Trading │ │ Telegram │ │ │
│ │ │ │ Core │ │ Bot │ │ │
│ │ │ │ (4 vCPU) │ │ (0.5 vCPU) │ │ │
│ │ │ └─────┬─────┘ └──────┬─────┘ │ │
│ ┌────────▼────────┐ │ │ │ │ │
│ │ Web API │ │ │ Service │ │ │
│ │ (1 vCPU) │◄────┼────────┤ Discovery ├─────────│ │
│ │ x2 tasks │ │ │ │ │ │
│ └─────────────────┘ │ ┌─────▼───────────────▼─────┐ │ │
│ │ │ Aurora PostgreSQL │ │ │
│ │ │ (Serverless v2) │ │ │
│ │ └───────────────────────────┘ │ │
│ │ ┌───────────────────────────┐ │ │
│ │ │ ElastiCache Redis │ │ │
│ │ │ (Multi-AZ) │ │ │
│ │ └───────────────────────────┘ │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Terraform Module Structure

We organized infrastructure into reusable modules:

infrastructure/terraform/
├── main.tf # Root module, wires everything together
├── variables.tf # Input variables
├── outputs.tf # Exported values
└── modules/
├── vpc/ # VPC, subnets, NAT gateways
├── ecs/ # ECS cluster, services, ALB
├── rds/ # Aurora PostgreSQL Serverless v2
├── elasticache/ # Redis cluster
└── secrets/ # AWS Secrets Manager + KMS

VPC Module

Multi-AZ setup with public and private subnets:

module "vpc" {
source = "./modules/vpc"

project_name = var.project_name
environment = var.environment
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Private subnets for ECS tasks, public subnets for ALB. NAT gateways enable outbound internet access for exchange APIs.

ECS Module

Three services with different resource profiles:

ServiceCPUMemoryCountPurpose
Trading Core4 vCPU8 GB1Arbitrage detection
Telegram Bot0.5 vCPU1 GB1User interface
Web API1 vCPU2 GB2REST/gRPC access

Trading Core gets compute-optimized resources because it runs the hot loop:

resource "aws_ecs_task_definition" "trading_core" {
family = "${local.name_prefix}-trading-core"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = 4096 # 4 vCPU
memory = 8192 # 8 GB

container_definitions = jsonencode([{
name = "trading-core"
image = var.trading_core_image

secrets = [
{ name = "POLY_PRIVATE_KEY", valueFrom = "..." },
{ name = "KALSHI_PRIVATE_KEY", valueFrom = "..." }
]
}])
}

Secrets Management

Credentials are stored in AWS Secrets Manager with KMS encryption:

resource "aws_kms_key" "secrets" {
description = "KMS key for secrets encryption"
deletion_window_in_days = 30
enable_key_rotation = true
}

resource "aws_secretsmanager_secret" "exchange_credentials" {
name = "${local.name_prefix}/exchange-credentials"
kms_key_id = aws_kms_key.secrets.arn
}

ECS tasks have IAM permissions to read secrets at startup. Secrets never touch disk.

Database: Aurora Serverless v2

Auto-scaling PostgreSQL for variable workloads:

resource "aws_rds_cluster" "main" {
cluster_identifier = "${local.name_prefix}-postgres"
engine = "aurora-postgresql"
engine_mode = "provisioned"
engine_version = "15.4"
database_name = "arbiter"

serverlessv2_scaling_configuration {
min_capacity = 0.5 # Scale to zero when idle
max_capacity = 16 # Scale up under load
}
}

Serverless v2 scales automatically based on load, reducing costs during low-activity periods.

GitHub Actions CI/CD

Two workflows handle CI and deployment:

CI Workflow (ci.yml)

Runs on every push:

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: cargo fmt --check
- run: cargo clippy -- -D warnings

test:
runs-on: ubuntu-latest
steps:
- run: cargo test --all-features

build:
runs-on: ubuntu-latest
steps:
- run: cargo build --release

security:
runs-on: ubuntu-latest
steps:
- run: cargo audit

Deploy Workflow (deploy.yml)

Triggered by version tags:

on:
push:
tags: ['v*']

jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- name: Build and push images
run: |
docker build -t $ECR_REPO:$TAG ./arbiter-engine
docker push $ECR_REPO:$TAG

- name: Deploy infrastructure
run: |
cd infrastructure/terraform
terraform init
terraform apply -auto-approve

- name: Update ECS services
run: |
aws ecs update-service --cluster $CLUSTER --service trading-core --force-new-deployment

Security Considerations

LayerProtection
NetworkPrivate subnets, security groups
SecretsKMS encryption, IAM policies
DatabaseRLS, encrypted at rest
ContainerECR image scanning
APIJWT authentication, rate limiting

Defense in depth: even if one layer is compromised, others provide protection.

Cost Optimization

ComponentStrategy
ECSFargate Spot for non-critical services
AuroraServerless v2 scales to zero
NAT GatewaySingle NAT for dev environments
SecretsRotation reduces breach window

Production uses dedicated NAT gateways per AZ for high availability.

Verification

# Validate Terraform configuration
terraform validate

# Plan changes
terraform plan -out=tfplan

# Apply infrastructure
terraform apply tfplan

# Verify services are running
aws ecs describe-services --cluster arbiter-prod-cluster

Lessons Learned

  1. Module everything - Reusable modules simplify multi-environment setups
  2. Secrets rotation - Build in rotation from day one
  3. Serverless v2 - Aurora's new mode is genuinely useful
  4. Service discovery - ECS Cloud Map simplifies internal communication
  5. Tag-based deploys - Version tags make rollback straightforward

The infrastructure supports the application's needs while remaining maintainable and cost-effective.

CI/CD for a Docs Site: ADR-004

· 4 min read
Amiable Dev
Project Contributors

How we built a deployment pipeline that stays fresh without manual intervention.

The Problem

We needed a CI/CD pipeline that could:

  1. Deploy on merge to main
  2. Aggregate docs from upstream repos daily
  3. Allow manual rebuilds with cache bypass
  4. Run security scanning without slowing deploys

Why GitHub Pages?

We considered three options:

PlatformCostPR PreviewsHTTPSVendor Count
GitHub PagesFreeNoAuto (*.github.io)1
Netlify/VercelFree tierYesAuto2
Railway~$5/moYesAuto2

Cost wasn't the deciding factor—all have generous free tiers. What mattered:

  1. Vendor consolidation - secrets, permissions, and logs in one place
  2. No external OAuth - fewer security surface areas
  3. Workflow simplicity - deploy-pages action just works

The trade-off: No PR preview deployments. We accepted this because our site is documentation—reviewing markdown diffs is sufficient. For a React app with visual changes, we'd choose differently.

Note: Custom domains need DNS configuration and propagation time. The *.github.io subdomain gets HTTPS immediately.

The Pipeline

Key insight: security.yml runs in parallel with deploy.yml. A linting failure doesn't block deployment—but it does show up as a failed check on the commit.

Three triggers, one pipeline:

on:
push:
branches: [main]
schedule:
- cron: '0 6 * * *' # Daily at 6 AM UTC
workflow_dispatch:
inputs:
force_refresh:
type: boolean
default: false

Caching Strategy

Template aggregation fetches docs from GitHub repos. Without caching, every build would re-fetch everything.

Our approach:

  1. Cache key includes hashFiles('templates.yaml') - config changes invalidate
  2. Restore keys allow partial cache hits
  3. Manifest tracking in aggregation script compares commit SHAs
- name: Restore template cache
if: ${{ github.event.inputs.force_refresh != 'true' }}
uses: actions/cache@v5
with:
path: .cache/templates
key: templates-${{ hashFiles('templates.yaml') }}-${{ github.run_id }}
restore-keys: |
templates-${{ hashFiles('templates.yaml') }}-
templates-

The force refresh option clears the cache entirely:

- name: Clear cache (if force refresh)
if: ${{ github.event.inputs.force_refresh == 'true' }}
run: rm -rf .cache/templates

Security Scanning

Separate workflow, parallel execution:

# security.yml
jobs:
gitleaks:
# Secret scanning on every push

dependency-review:
# License and vulnerability check on PRs

yaml-lint:
# Configuration validation

This keeps security checks from blocking deploys while still catching issues.

The yamllint War Story

Our first security run failed spectacularly:

##[error]mkdocs.yml:88:5 [indentation] wrong indentation: expected 6 but found 4
##[error]templates.yaml:45:121 [line-length] line too long (156 > 120 characters)
##[warning].github/workflows/deploy.yml:3:1 [truthy] truthy value should be one of [false, true]

The investigation revealed three conflicts:

  1. on: is not a boolean - GitHub Actions uses on: as a keyword, but yamllint sees it as a truthy value
  2. MkDocs doesn't require --- - yamllint's document-start rule expects it
  3. Description fields are long - template descriptions exceed 120 characters

The fix: .yamllint.yml configuration that respects ecosystem conventions:

rules:
# GitHub Actions uses `on:` as a keyword
truthy:
allowed-values: ['true', 'false', 'on']

# MkDocs files don't need document start
document-start: disable

# Allow longer lines for descriptions
line-length:
max: 200

Lesson: Linting tools need per-ecosystem configuration. Default rules assume vanilla YAML.

Build Times

ScenarioTime
Cold build (no cache)~45s
Warm build (cached)~20s
Force refresh~45s

Most deploys hit the cache. Daily scheduled builds may be slower if upstream repos changed.

What We Learned

  1. Separate security from deploy - don't let linting failures block urgent content fixes
  2. Cache aggressively, invalidate precisely - manifest-based tracking beats time-based expiry
  3. Make force refresh easy - when caching goes wrong, you need an escape hatch

What's Next

  • ADR-005: DevSecOps implementation (the security.yml details)

Links: