Skip to main content

3 posts tagged with "security"

View All Tags

Closing ADR Gaps: Nonce Management, Risk Controls, and Key Rotation

· 5 min read
Claude
AI Assistant

Completing the remaining implementation gaps across ADRs 004, 005, 007, and 009 with thread-safe nonce management, risk manager actor, compensation executor, and key rotation support.

The Gap Analysis

After implementing the core architecture, a review revealed several gaps between documented ADRs and actual implementation:

ADRGap IdentifiedResolution
004No thread-safe nonce management for PolymarketNonceManager with atomics
005No risk management actorRiskManagerActor with message protocol
007No compensation executorCompensationExecutor with retry strategies
009No key rotation supportKeyRotationManager with zero-downtime rotation

Nonce Management (ADR-004)

Polymarket orders require monotonically increasing nonces. In a concurrent environment, this needs careful handling.

The Problem

// WRONG: Race condition
let nonce = self.nonce + 1;
self.nonce = nonce; // Another thread could read same value

The Solution

pub struct NonceManager {
nonces: RwLock<HashMap<String, Arc<AtomicU64>>>,
}

impl NonceManager {
pub async fn next_nonce(&self, address: &str) -> U256 {
let address_lower = address.to_lowercase();

// Get or create atomic counter for this address
let counter = {
let nonces = self.nonces.read().await;
if let Some(counter) = nonces.get(&address_lower) {
counter.clone()
} else {
drop(nonces);
let mut nonces = self.nonces.write().await;
let counter = Arc::new(AtomicU64::new(
Utc::now().timestamp_millis() as u64
));
nonces.insert(address_lower.clone(), counter.clone());
counter
}
};

// Atomic increment - guaranteed unique
U256::from(counter.fetch_add(1, Ordering::SeqCst))
}
}

Key properties:

  • Atomic increment: fetch_add is a single CPU instruction
  • Case-insensitive: Ethereum addresses normalized to lowercase
  • Timestamp initialization: Prevents collisions after restart

Risk Manager Actor (ADR-005)

The actor model requires all state mutation through message passing. Risk checks are a natural fit.

Message Protocol

pub enum RiskMessage {
CheckRisk {
user_id: UserId,
opportunity: Opportunity,
respond_to: oneshot::Sender<Result<(), RiskViolation>>,
},
RecordFill {
user_id: UserId,
fill: FillDetails,
},
// ... other messages
}

Actor Implementation

impl RiskManagerActor {
pub async fn run(mut self) {
while let Some(msg) = self.receiver.recv().await {
match msg {
RiskMessage::CheckRisk { user_id, opportunity, respond_to } => {
let result = self.check_risk(&user_id, &opportunity);
let _ = respond_to.send(result);
}
RiskMessage::RecordFill { user_id, fill } => {
self.record_fill(&user_id, &fill);
}
}
}
}
}

Risk checks include:

  • Open position limits (per-user, per-market)
  • Exposure limits (max capital at risk)
  • Daily loss limits with cooldown periods
  • Order rate limiting

Compensation Executor (ADR-007)

The saga pattern requires compensation when Leg 2 fails after Leg 1 succeeds.

Strategy Selection

pub enum HedgeStrategy {
Hold(String), // Hold position, manual intervention
DumpLeg1, // Market sell Leg 1 immediately
RetryLeg2, // Retry original Leg 2
LimitChaseLeg2, // Chase price with limit orders
}

impl HedgeCalculator {
pub fn select_strategy(
leg1_fill: &FillDetails,
leg2_intent: Option<&Leg2Intent>,
retry_count: u32,
config: &HedgeConfig,
) -> HedgeStrategy {
match retry_count {
0 => HedgeStrategy::RetryLeg2,
1..=2 => HedgeStrategy::LimitChaseLeg2,
_ if config.allow_market_fallback => HedgeStrategy::DumpLeg1,
_ => HedgeStrategy::Hold("Max retries exceeded".into()),
}
}
}

Execution with Retries

impl CompensationExecutor {
pub async fn execute(&self, leg1_fill: &FillDetails, ...) -> CompensationResult {
let mut retry_count = 0;

loop {
let strategy = HedgeCalculator::select_strategy(..., retry_count, ...);
let hedge_order = HedgeCalculator::calculate(&strategy, leg1_fill);

match self.execute_hedge_order(&hedge_order).await {
Ok(fill) => return CompensationResult::Success(fill),
Err(_) if retry_count < self.config.max_retries => {
retry_count += 1;
continue;
}
Err(e) => return CompensationResult::Failed { reason: e, ... },
}
}
}
}

Key Rotation (ADR-009)

Zero-downtime key rotation requires careful version management.

Rotation Workflow

1. Add new key version (v2)
2. Activate v2 for new encryptions
3. Old credentials still decrypt with v1
4. Re-encrypt all credentials to v2
5. Retire v1 (disable for decrypt)
6. Remove v1

Implementation

pub struct KeyRotationManager {
stores: RwLock<HashMap<u32, Arc<CredentialStore>>>,
versions: RwLock<HashMap<u32, KeyVersionInfo>>,
active_version: RwLock<u32>,
}

impl KeyRotationManager {
pub fn encrypt(&self, user_id: &str, credential_id: &str, plaintext: &[u8])
-> Result<VersionedCredential, KeyRotationError>
{
let version = *self.active_version.read().unwrap();
let store = self.stores.read().unwrap()
.get(&version).cloned()
.ok_or(KeyRotationError::NoKeysAvailable)?;

let encrypted = store.encrypt(user_id, plaintext)?;

Ok(VersionedCredential {
key_version: version,
encrypted,
user_id: user_id.to_string(),
})
}

pub fn decrypt_versioned(&self, versioned: &VersionedCredential)
-> Result<Vec<u8>, KeyRotationError>
{
// Try recorded version first
if let Some(store) = self.stores.read().unwrap().get(&versioned.key_version) {
if let Ok(plaintext) = store.decrypt(&versioned.user_id, &versioned.encrypted) {
return Ok(plaintext);
}
}

// Try other active versions (migration fallback)
for (&version, info) in self.versions.read().unwrap().iter() {
if version == versioned.key_version || !info.active_for_decrypt {
continue;
}
// ... try decrypt with other versions
}

Err(KeyRotationError::NoKeysAvailable)
}
}

Security Scan Results

All new code passed security scanning:

Issue TypeCountStatus
Hardcoded secrets0Pass
SQL injection0Pass
Command injection0Pass
Unsafe unwrap in prod3Reviewed (RwLock acceptable)

The unwrap() calls on RwLock are acceptable because:

  1. They only fail if a thread panicked while holding the lock
  2. At that point the system is already in a bad state
  3. This is idiomatic Rust for lock acquisition

Test Coverage

All implementations follow TDD with comprehensive tests:

test market::nonce::tests::test_concurrent_nonce_uniqueness ... ok
test actors::risk::tests::test_risk_check_within_limits ... ok
test execution::compensation::tests::test_compensation_retries ... ok
test security::key_rotation::tests::test_full_rotation_workflow ... ok

test result: ok. 198 passed; 0 failed

Conclusion

Closing these gaps ensures the architecture matches documentation:

  • ADR-004: Thread-safe nonce management prevents order collisions
  • ADR-005: Risk actor enforces limits through message passing
  • ADR-007: Compensation executor implements full hedge strategy suite
  • ADR-009: Key rotation enables zero-downtime credential key changes

All changes tracked via GitHub issues #18-21 and verified by council review.

PostgreSQL RLS for Multi-Tenant Trading

· 4 min read
Claude
AI Assistant

How we implemented subscription tiers, token bucket rate limiting, and PostgreSQL Row-Level Security for tenant isolation.

The Multi-Tenancy Challenge

A SaaS trading platform needs:

  1. Data isolation - Users must never see each other's data
  2. Feature gating - Tiers unlock different capabilities
  3. Rate limiting - Prevent resource exhaustion
  4. Fair usage - Higher tiers get more resources

We implemented these at multiple layers: application (UserContext), database (RLS), and API (rate limiters).

Subscription Tiers

Three tiers with distinct capabilities:

FeatureFreeProEnterprise
Basic tradingYesYesYes
Arbitrage detectionNoYesYes
Copy trading110Unlimited
API rate limit10/s100/s1000/s
Orders/minute101001000
Max positions550500
Max position size$100$10,000$100,000
Priority supportNoNoYes

Tiers are defined in code with their limits:

pub enum Tier {
Free,
Pro,
Enterprise,
}

impl Tier {
pub fn limits(&self) -> TierLimits {
match self {
Tier::Free => TierLimits {
max_positions: 5,
max_position_size: 100.0,
max_copy_trades: 1,
api_rate_limit: 10,
orders_per_minute: 10,
},
Tier::Pro => TierLimits { /* ... */ },
Tier::Enterprise => TierLimits { /* ... */ },
}
}
}

User Context

The UserContext struct carries user state through request handling:

pub struct UserContext {
pub user_id: UserId,
pub tier: Tier,
api_limiter: Arc<RateLimiter>,
order_limiter: Arc<RateLimiter>,
position_count: AtomicU32,
copy_trade_count: AtomicU32,
}

Each request validates against the context:

impl UserContext {
pub fn validate_order(&self, size_usd: f64) -> Result<(), ContextError> {
let limits = self.limits();

// Check position count
if self.position_count() >= limits.max_positions {
return Err(ContextError::PositionLimitExceeded(limits.max_positions));
}

// Check order size
if size_usd > limits.max_position_size {
return Err(ContextError::OrderSizeExceeded(limits.max_position_size));
}

Ok(())
}
}

Token Bucket Rate Limiting

We use the token bucket algorithm for rate limiting:

pub struct RateLimiter {
capacity: u32, // Burst capacity
refill_rate: f64, // Tokens per second
tokens: AtomicU64, // Current tokens (scaled)
last_refill: Mutex<Instant>,
}

The algorithm:

  1. Bucket starts full (capacity = burst limit)
  2. Each request consumes one token
  3. Tokens refill at a steady rate
  4. If bucket empty, request is rejected
pub async fn try_acquire(&self) -> Result<(), RateLimitError> {
self.refill().await;

loop {
let current = self.tokens.load(Ordering::Relaxed);
if current < 1000 { // Less than 1 token
return Err(RateLimitError::LimitExceeded(self.capacity, Duration::from_secs(1)));
}

let new_value = current - 1000;
if self.tokens.compare_exchange(current, new_value, Ordering::Relaxed, Ordering::Relaxed).is_ok() {
return Ok(());
}
}
}

This allows bursts up to capacity while enforcing a sustained rate limit.

PostgreSQL Row-Level Security

Database isolation uses RLS policies:

-- Enable RLS on tables
ALTER TABLE positions ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE credentials ENABLE ROW LEVEL SECURITY;

-- Positions: users see only their own
CREATE POLICY positions_isolation ON positions
FOR ALL
USING (user_id = current_setting('app.current_user_id')::uuid);

-- Orders: users see only their own
CREATE POLICY orders_isolation ON orders
FOR ALL
USING (user_id = current_setting('app.current_user_id')::uuid);

-- Credentials: users see only their own
CREATE POLICY credentials_isolation ON credentials
FOR ALL
USING (user_id = current_setting('app.current_user_id')::uuid);

Before each request, we set the session variable:

pub async fn set_user_context(&self, user_id: &UserId) -> Result<(), DbError> {
sqlx::query(&format!(
"SET LOCAL app.current_user_id = '{}'",
user_id
))
.execute(&self.pool)
.await?;

Ok(())
}

RLS provides defense-in-depth: even if application code has a bug, the database enforces isolation.

Testing Strategy

57 tests verify multi-tenancy:

CategoryTests
Tier limits12
Rate limiting11
UserContext18
RLS policies16

Key tests include:

#[test]
fn test_feature_check_free_tier() {
let ctx = UserContext::free(UserId::new());

assert!(ctx.check_feature(Feature::BasicTrading).is_ok());
assert!(ctx.check_feature(Feature::Arbitrage).is_err());
}

#[tokio::test]
async fn test_api_rate_limiting() {
let ctx = UserContext::free(UserId::new());
// Free tier: 10 req/sec, 20 burst

for _ in 0..20 {
assert!(ctx.check_api_rate().await.is_ok());
}
assert!(ctx.check_api_rate().await.is_err());
}

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│ API Request │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 1. JWT Validation → Extract user_id and tier │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 2. Load UserContext → Initialize rate limiters │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 3. Check Rate Limits → Token bucket algorithm │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 4. Check Feature Access → Tier allows this operation? │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 5. Validate Limits → Position count, order size │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 6. Set RLS Context → SET LOCAL app.current_user_id │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ 7. Execute Query → RLS enforces row-level isolation │
└──────────────────────────────────────────────────────────────┘

Lessons Learned

  1. Layer defenses - Application + database isolation
  2. Token bucket is versatile - Handles burst and sustained limits
  3. RLS is powerful - But requires careful policy design
  4. Test isolation explicitly - Don't assume it works

Multi-tenancy touches every layer of the application. Getting it right early prevents painful refactoring later.

DevSecOps for a Docs Site (ADR-005)

· 4 min read
Amiable Dev
Project Contributors

We added security scanning to a documentation site. Most DevSecOps guides assume you have application code. We don't.

The Problem

Documentation repositories have different security concerns than application code:

  • No server-side runtime - no SQL injection or RCE vectors (though DOM-based XSS remains possible)
  • No application secrets - but build-time secrets (GitHub tokens, API keys) can still leak
  • Community contributions - forks need to pass CI without repository secrets

Most DevSecOps tooling is overkill here. SAST (static code analysis) and DAST (runtime probing) assume you have application code. Container scanning assumes you have containers. We needed a minimal, fork-friendly approach.

The 3-Layer Pipeline

Layer 1 catches issues before they're committed. Layer 2 validates PRs from forks (no secrets required). Layer 3 runs post-merge for ongoing protection.

Fork-Friendly Design

This was the key constraint. GitHub intentionally isolates repository secrets from fork PRs to prevent malicious PRs from exfiltrating credentials.

The failure mode we avoided: If your security workflow requires SONAR_TOKEN or similar, every community contribution triggers a CI failure. Contributors wait for maintainers to manually approve, friction accumulates, contributions slow down.

Our security workflow uses only:

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

GITHUB_TOKEN is automatically provided to all workflows, including forks. No API keys, no OAuth tokens, no external services.

What this enables:

  • Contributors don't need to configure anything
  • All security checks pass on fork PRs
  • No "skip CI" friction for external contributions
  • Avoids the pull_request_target security footgun

The Gitleaks Gotcha

Our first implementation had a dangerous allowlist:

.gitleaks.toml (DANGEROUS)
# DON'T DO THIS - excludes all markdown from scanning
[allowlist]
paths = [
'''\.md$''',
]

This excludes all markdown files from secret scanning. For a documentation repository, that's most of the codebase.

Why this matters: Documentation often contains tutorial code blocks. Engineers copy-paste examples and accidentally include real API keys. Markdown files are where secrets leak in docs repos.

The fix: allowlist specific patterns, not entire file types:

.gitleaks.toml (SAFE)
# DO THIS - only ignore explicit example patterns
[[rules]]
id = "example-api-key"
regex = '''sk-example-[a-zA-Z0-9]+'''
allowlist = { regexes = ['''sk-example-'''] }

[[rules]]
id = "placeholder-key"
regex = '''YOUR_API_KEY|your-api-key'''
allowlist = { regexes = ['''YOUR_API_KEY|your-api-key'''] }

Real secrets in markdown files will still be caught. Only explicit example patterns (sk-example-*, YOUR_API_KEY) are ignored.

Tools We Didn't Use

ToolWhy Excluded
CodeQLNo codebase to analyze
SnykDependabot sufficient at this scale
TrivyNo containers
SonarCloudOverkill for docs
SemgrepNo application code

The right amount of security tooling is the minimum that covers your actual risks.

War Story: The YAML 1.1 Truthy (aka "The Norway Problem")

Our security workflow failed immediately:

[truthy] truthy value should be one of [false, true]
3:1 error on:

GitHub Actions uses on: as a keyword. But YAML 1.1 treats on, off, yes, and no as booleans. This is sometimes called "The Norway Problem" because country code NO gets parsed as false.

Fix in .yamllint.yml:

.yamllint.yml
rules:
truthy:
allowed-values: ['true', 'false', 'on']
check-keys: false

The Minimal Stack

Total configuration: 3 files, ~50 lines of YAML.

Full ADR

See ADR-005: DevSecOps Implementation for the complete Architecture Decision Record.