23 posts tagged with "python"

Eliminating Request Waterfalls: Parallel Data Fetching for SRE Dashboards

February 10, 2026 · 5 min read

Published: 2025-01-16

When an SRE responds to an incident, every second counts. Yet our dashboard was making them wait - not because the backend was slow, but because we'd accidentally created a request waterfall that serialized all our data loading. Here's how we fixed it.

The Problem

Our React application followed a common but flawed pattern: lazy load the component code, render it, then fetch data. This creates what's known as a "request waterfall":

User Authenticates
  → Component Code Loads (100ms)
    → Component Renders
      → useQuery fires (network latency ~200ms)
        → Data arrives
          → Re-render with data

For our SRE Dashboard, this meant loading 6 different data sources sequentially:

Dashboard summary
Health scores
Top issues
Applications list
Active incidents
SLO status

Each waited for the previous to complete. A 50ms backend response became a 300ms waterfall when chained.

The LLM Council review of ADR-036 (Lazy Loading) caught this flaw:

"The lazy loading strategy is fundamentally flawed. Code → Render → Data pattern degrades performance (sequential instead of parallel)."

The verdict was REJECTED. We needed to fix this.

The Solution

The fix is conceptually simple: fetch data in parallel with component code, not after.

User Authenticates
  ├── Component Code Loads (100ms)     ← PARALLEL
  └── preloadCriticalData() fires      ← PARALLEL
      ├── /sre-dashboard
      ├── /health-score
      ├── /applications
      ├── /incidents
      ├── /slos
      └── /error-budgets
  → Component Renders with data already in cache

We implemented this with three key pieces:

1. Critical Path Manifest

First, we classified routes by urgency:

// frontend/src/utils/preload.ts
export const CRITICAL_ROUTES = {
  // Immediate - User likely to visit first
  IMMEDIATE: ['/sre-dashboard', '/incidents'],

  // Preload - User likely to visit soon after
  PRELOAD: ['/slos', '/slis', '/error-budgets', '/runbooks'],

  // Lazy - Only load when navigating
  LAZY: ['/settings', '/admin/*', '/analytics/*', '/reports/*'],
} as const;

This tells us what to preload aggressively vs. what can wait.

2. Shared Query Options

We created reusable query options that both the preloader and components use:

// frontend/src/routes/loaders/sreLoaders.ts
export const queryOptions = {
  sreDashboard: (applicationFilter: string = 'all') => ({
    queryKey: ['sre-dashboard', applicationFilter],
    queryFn: async () => {
      const response = await apiClient.get(`/sre-dashboard?${params}`);
      return response.data;
    },
    staleTime: 30000, // Consider fresh for 30 seconds
  }),

  healthScore: (applicationFilter: string = 'all') => ({
    queryKey: ['sre-health-score', applicationFilter],
    queryFn: async () => {
      const response = await apiClient.get(`/sre-dashboard/health-score?${params}`);
      return response.data;
    },
    staleTime: 30000,
  }),
  // ... more query options
};

The key insight: by sharing queryKey and queryFn between preloaders and components, React Query automatically deduplicates requests and shares the cache.

3. Authentication-Triggered Preloading

We created a hook that fires preloading immediately after authentication:

// frontend/src/hooks/usePreloadCriticalData.ts
export function usePreloadCriticalData(options: { enabled?: boolean } = {}) {
  const queryClient = useQueryClient();
  const preloadedRef = useRef(false);

  useEffect(() => {
    if (!options.enabled || preloadedRef.current) return;
    preloadedRef.current = true;

    // Fire all preloads in parallel
    preloadCriticalData(queryClient);
  }, [options.enabled, queryClient]);
}

The preloadedRef ensures we only preload once per session, even if the user navigates between routes.

4. App Integration

Finally, we integrated the hook into App.tsx:

function App() {
  const { user } = useAuth();

  // Preload critical SRE data after authentication (Issue #452)
  usePreloadCriticalData({ enabled: !!user });

  // ... rest of app
}

Implementation Details

The preload function fires 6 parallel requests using Promise.all:

export async function preloadCriticalData(queryClient: QueryClient): Promise<void> {
  console.log('[Preload] Starting critical data preload...');
  const startTime = performance.now();

  try {
    await Promise.all([
      queryClient.prefetchQuery(queryOptions.sreDashboard('all')),
      queryClient.prefetchQuery(queryOptions.healthScore('all')),
      queryClient.prefetchQuery(queryOptions.topIssues('all')),
      queryClient.prefetchQuery(queryOptions.applications()),
      queryClient.prefetchQuery(queryOptions.incidents('Active')),
      queryClient.prefetchQuery(queryOptions.slos()),
      queryClient.prefetchQuery(queryOptions.errorBudgets()),
    ]);

    const elapsed = performance.now() - startTime;
    console.log(`[Preload] Critical data loaded in ${elapsed.toFixed(0)}ms`);
  } catch (error) {
    // Don't throw - preloading is best-effort
    console.warn('[Preload] Failed to preload some critical data:', error);
  }
}

Note the error handling: preloading is best-effort. If some requests fail, the app still works - the component's useQuery will fetch the data normally.

Results

The performance improvement is significant:

Metric	Before	After	Improvement
Dashboard initial load	~800ms	~300ms	62% faster
Subsequent navigation	~200ms	<50ms	Near-instant
Network requests	Sequential	Parallel	6x concurrency

More importantly, the user experience improved:

SRE Dashboard renders with data immediately after login
Navigation between critical routes feels instant
Cache remains valid for 30 seconds, reducing redundant requests

E2E Testing

We added comprehensive E2E tests to verify the parallel loading behavior:

test('should preload critical API data after authentication', async ({ page }) => {
  const apiRequests: string[] = [];

  page.on('request', (request) => {
    if (request.url().includes('/api/')) {
      apiRequests.push(request.url());
    }
  });

  // Login and wait for preloading
  await page.goto(`${BASE_URL}/login`);
  await page.fill('input[name="email"]', 'admin@example.com');
  await page.fill('input[name="password"]', 'admin123');
  await page.click('button:has-text("Sign In")');
  await page.waitForURL(`${BASE_URL}/sre-dashboard`);
  await page.waitForTimeout(2000);

  // Verify critical endpoints were called
  const criticalEndpoints = ['sre-dashboard', 'health-score', 'applications', 'incidents', 'slos'];
  const foundEndpoints = criticalEndpoints.filter((endpoint) =>
    apiRequests.some((url) => url.includes(endpoint))
  );

  expect(foundEndpoints.length).toBeGreaterThanOrEqual(4);
});

Lessons Learned

Lazy loading isn't always the answer. Sometimes it introduces worse problems than it solves. The code → render → data waterfall is a classic trap.
LLM Council reviews catch architectural issues. The REJECTED verdict on ADR-036 forced us to think harder about the performance implications.
React Query's cache is powerful. By sharing query options between preloaders and components, we get automatic deduplication and cache sharing.
Best-effort preloading is resilient. If preloading fails, the app still works. This makes the feature safe to deploy.
Critical path thinking matters. Not all routes need instant loading. Categorizing by urgency lets us focus resources where they matter most.

This post details the implementation of ADR-036 Issue #452 fix.

From False Positives to Precision: Weighted Duplicate Scoring

February 10, 2026 · 5 min read

SRE Operations Platform

Published: 2025-01-16

Duplicate detection sounds simple: find things that are similar. But when your system uses multiple detection strategies, merging their results becomes the hard part. Our original implementation used a simple extend() pattern that created more problems than it solved. Here's how weighted scoring fixed it.

The Problem

Our duplicate detection system used three complementary strategies:

Strategy	Strength	Weakness
Vector	Catches semantic similarity	Expensive (embedding generation)
Text	Catches near-exact wording	Misses paraphrases
Keyword	Fast topic matching	High false positive rate

The original merge logic was straightforward:

def _merge_results(main_results, new_results, method):
    """Just extend the lists"""
    for level in ["high_confidence", "medium_confidence", "low_confidence"]:
        if level in new_results:
            for item in new_results[level]:
                if item["id"] not in seen_ids:
                    main_results[level].append(item)

The problem? This treats all strategies as an unconditional OR. If vector says 0.85 similarity and keyword says 0.60 similarity for the same item, which wins?

With extend(), both get added - or worse, only the first one encountered is kept. We lose the precision that comes from multiple strategies agreeing.

The LLM Council review caught this immediately:

"Simple extend treats all strategies as OR → 'High CPU on DB-01' and 'High CPU on DB-02' incorrectly merged as duplicates. Vector catches semantic similarity; fuzzy text catches similar hostnames → distinct incidents merged."

The verdict was REQUEST CHANGES. We needed weighted scoring.

The Solution

The fix required two conceptual shifts:

Weighted aggregation instead of simple append
Multi-strategy agreement as a confidence signal

Weighted Scoring Algorithm

We combine strategy scores with configurable weights:

@dataclass
class WeightedScoringConfig:
    vector_weight: float = 0.5   # Semantic similarity is most reliable
    text_weight: float = 0.3    # Text matching is secondary
    keyword_weight: float = 0.2  # Keyword is least precise
    min_strategies_for_high: int = 2

def calculate_weighted_score(strategy_scores: dict) -> float:
    """
    Score = (Vector * 0.5) + (Text * 0.3) + (Keyword * 0.2)
    Normalizes if some strategies are missing.
    """
    weights = {
        "vector": self.weighted_config.vector_weight,
        "text": self.weighted_config.text_weight,
        "keyword": self.weighted_config.keyword_weight,
    }

    total_weight = 0.0
    weighted_sum = 0.0

    for strategy, score in strategy_scores.items():
        if strategy in weights:
            weight = weights[strategy]
            weighted_sum += score * weight
            total_weight += weight

    # Normalize if not all strategies were used
    if total_weight > 0:
        return weighted_sum / total_weight
    return 0.0

This means if only vector and text detect a match, we normalize: (0.90 * 0.5 + 0.80 * 0.3) / (0.5 + 0.3) = 0.875.

Multi-Strategy Agreement

A single strategy match should never produce high confidence - too much risk of false positives:

def determine_confidence_level(result, min_strategies_for_high=2):
    """
    High confidence requires multiple strategies to agree.
    Single-strategy matches can only be medium or low.
    """
    strategies_matched = result.get("strategies_matched", 0)
    weighted_score = self.calculate_weighted_score(result.get("scores", {}))

    # Multi-strategy agreement required for high confidence
    if strategies_matched >= min_strategies_for_high and weighted_score >= 0.80:
        return "high"
    elif strategies_matched >= 1 and weighted_score >= 0.65:
        return "medium"
    else:
        return "low"

This simple change dramatically reduces false positives. Even a 0.95 vector similarity can't produce "high" confidence alone.

Entity-Specific Thresholds

Different entity types need different sensitivity:

ENTITY_THRESHOLDS = {
    "incidents": EntityThresholds(
        high_confidence=0.95,    # Strict - don't suppress real alarms
        medium_confidence=0.85,
        low_confidence=0.75,
    ),
    "runbooks": EntityThresholds(
        high_confidence=0.80,    # Loose - find helpful content
        medium_confidence=0.70,
        low_confidence=0.60,
    ),
    "requirements": EntityThresholds(
        high_confidence=0.85,    # Default - balanced
        medium_confidence=0.75,
        low_confidence=0.65,
    ),
}

For incidents, we want almost-exact matches only (0.95). False negatives are better than suppressing real alarms. For runbooks, we're more permissive (0.80) because finding helpful related content is the goal.

Cascading Execution

We also optimized for performance by running cheap strategies first:

def get_strategy_execution_order() -> list[str]:
    return ["keyword", "text", "vector"]  # Cheapest first

Keyword matching is just string comparison - nearly free. Vector requires embedding generation or lookup - expensive. By running keyword first, we can potentially skip the expensive vector check if keyword already disqualifies the match.

Implementation

The updated _merge_results now tracks per-strategy scores:

def _merge_results(main_results, new_results, method):
    """Merge with weighted scoring instead of simple extend."""

    # Initialize tracking dict
    if "_item_scores" not in main_results:
        main_results["_item_scores"] = {}

    for level in ["high_confidence", "medium_confidence", "low_confidence"]:
        for item in new_results.get(level, []):
            item_id = item["id"]

            # Track per-strategy scores for this item
            if item_id not in main_results["_item_scores"]:
                main_results["_item_scores"][item_id] = {
                    "item_data": item,
                    "strategy_scores": {},
                    "strategies_matched": 0,
                }

            # Add score for this strategy
            score = extract_score_by_method(item, method)
            main_results["_item_scores"][item_id]["strategy_scores"][method] = score
            main_results["_item_scores"][item_id]["strategies_matched"] += 1

    # Recategorize based on weighted scoring
    self._recategorize_with_weighted_scoring(main_results)

The key insight: we don't decide the confidence level when processing each strategy. We wait until all strategies have contributed their scores, then calculate the weighted result.

Results

The weighted scoring approach provides:

Metric	Before	After
False Positive Rate	High (~30%)	Low (~5%)
Multi-strategy matches	Not tracked	Prioritized
Entity-specific tuning	None	Full support
Debugging info	Lost	Preserved

Each result now includes full strategy breakdown:

{
  "id": "REQ-001",
  "weighted_score": 0.875,
  "confidence_level": "high",
  "strategies_matched": 2,
  "strategy_scores": {
    "vector": 0.90,
    "text": 0.85
  }
}

This makes debugging straightforward: you can see exactly which strategies contributed and with what scores.

Lessons Learned

Simple OR logic loses precision - When merging results from multiple strategies, preserve the individual contributions.
Agreement is a signal - Two strategies detecting the same match is much stronger evidence than one strategy with a high score.
Different entities need different thresholds - What's "similar enough" for runbooks is too loose for incidents.
Preserve debugging info - The per-strategy scores are invaluable for tuning thresholds and investigating false positives.
Order matters for performance - Run cheap strategies first to enable early termination.

This post details the implementation of ADR-038: Duplicate Detection weighted scoring improvements.

Entity Relationship Primary Key Fix: Preventing Silent Data Corruption

February 10, 2026 · 3 min read

SRE Operations Platform

Published: 2025-01-16

When the LLM Council reviewed our entity relationships implementation (ADR-026), they flagged a critical flaw: our 4-column primary key blocked multiple relationship types between the same entities, and duplicate relationships could silently corrupt traceability data.

This post details how Issue #458 fixed both data integrity gaps with a proper 5-column primary key.

The Problem

Our EntityRelationship table connects any two entities with typed relationships:

-- Original schema (simplified)
CREATE TABLE entity_relationships (
    source_type VARCHAR(50),
    source_id VARCHAR(255),
    target_type VARCHAR(50),
    target_id VARCHAR(255),
    relationship_type VARCHAR(50),
    PRIMARY KEY (source_type, source_id, target_type, target_id)
    -- Notice: relationship_type NOT in primary key!
);

The 4-column primary key created two problems:

Problem 1: Blocked Multiple Relationship Types

-- Insert #1: REQ-001 depends on REQ-002
INSERT INTO entity_relationships
VALUES ('requirement', 'REQ-001', 'requirement', 'REQ-002', 'depends_on');
-- SUCCESS!

-- Insert #2: REQ-001 also references REQ-002
INSERT INTO entity_relationships
VALUES ('requirement', 'REQ-001', 'requirement', 'REQ-002', 'references');
-- FAILS! PK violation (same 4-column key)

This was fundamentally broken—entities couldn't have multiple relationship types!

Problem 2: Potential Duplicate Relationships

Without proper constraints, race conditions or UI bugs could create duplicate relationships.

Consequences

Feature Blockage: Can't express "depends_on AND references" relationships
Graph Pollution: (if duplicates existed) Wrong edge counts
IEEE 29148-2018 Non-Compliance: Traceability requires rich relationships

The Solution

Change the primary key to 5 columns, including relationship_type:

# backend/models/generic_relationships.py
class EntityRelationship(Base):
    __tablename__ = "entity_relationships"

    # 5-column composite primary key
    source_type = Column(String(50), primary_key=True)
    source_id = Column(String(255), primary_key=True)
    target_type = Column(String(50), primary_key=True)
    target_id = Column(String(255), primary_key=True)
    relationship_type = Column(String(50), primary_key=True)  # Now in PK!

Key Insight: Multiple Relationship Types Are Now Valid

With the 5-column PK:

REQ-001 → REQ-002 with depends_on ✓
REQ-001 → REQ-002 with references ✓ (different relationship type = different row)
REQ-001 → REQ-002 with depends_on again ✗ (exact duplicate blocked by PK)

Migration Strategy

The migration handles the schema change safely:

def upgrade():
    # Step 1: Remove any duplicates (keep newest by updated_at)
    conn.execute(text("""
        DELETE FROM entity_relationships
        WHERE ctid NOT IN (
            SELECT DISTINCT ON (source_type, source_id, target_type, target_id, relationship_type)
                   ctid
            FROM entity_relationships
            ORDER BY source_type, source_id, target_type, target_id, relationship_type,
                     updated_at DESC NULLS LAST
        )
    """))

    # Step 2: Drop 4-column primary key
    op.drop_constraint("entity_relationships_pkey", "entity_relationships", type_="primary")

    # Step 3: Create 5-column primary key
    op.create_primary_key(
        "entity_relationships_pkey",
        "entity_relationships",
        ["source_type", "source_id", "target_type", "target_id", "relationship_type"],
    )

Testing

Our TDD tests verify the constraint at the database level:

def test_duplicate_relationship_rejected_at_database_level(self, postgres_db):
    """Exact duplicate 5-tuple MUST raise IntegrityError."""
    rel1 = EntityRelationship(
        source_type="requirement", source_id="REQ-DUP-001",
        target_type="capability", target_id="CAP-DUP-001",
        relationship_type="depends_on",
    )
    postgres_db.add(rel1)
    postgres_db.flush()

    # Attempt exact duplicate
    rel2_duplicate = EntityRelationship(
        source_type="requirement", source_id="REQ-DUP-001",
        target_type="capability", target_id="CAP-DUP-001",
        relationship_type="depends_on",  # Same!
    )
    postgres_db.add(rel2_duplicate)

    with pytest.raises(IntegrityError):
        postgres_db.flush()  # MUST fail

Impact

Before	After
Duplicates silently accepted	`IntegrityError` on duplicate
Graph traversal counts edges incorrectly	Clean graph structure
UI shows duplicates	No duplicates possible
ADR-026: CONDITIONAL	ADR-026: APPROVED

Lessons Learned

Composite Primary Keys Need Review: Adding a column to a table doesn't mean it's part of the uniqueness guarantee
Semantic Differences Matter: "depends_on" and "references" are different relationships—the fix preserves this distinction
Database Constraints > Application Validation: Race conditions can bypass application checks; database constraints are authoritative
Migration Must Handle Existing Data: Don't just add constraints—clean up legacy data first

Issue #458 | ADR-026 | LLM Council Blocking Issue Resolved

MCP Bidirectional Traffic: Fixing SSE Buffering and Rate Limits

February 10, 2026 · 4 min read

SRE Operations Platform

Published: 2025-01-16

When the LLM Council reviewed our MCP client proxy (ADR-040), they identified a critical gap: our nginx configuration was buffering SSE responses, causing tool execution to hang. Additionally, our standard API rate limits (60 req/min) were breaking MCP negotiation, which is inherently chatty.

This post details how Issue #460 fixed these SSE streaming issues.

The Problem

Problem 1: nginx Buffering Blocks SSE

Default nginx proxy configuration buffers responses:

# Default behavior (problematic for SSE)
location /api/ {
    proxy_pass http://backend;
    # proxy_buffering is ON by default!
}

When SSE events are buffered, they arrive in bursts instead of real-time. For MCP:

Tool execution appears to hang for seconds
Timeouts during long-running operations
Poor user experience in Claude Desktop

Problem 2: Rate Limits Break MCP Negotiation

MCP protocol is chatty during initialization:

Capabilities exchange
Tool listing
Prompt listing
Resource queries

Our standard 60 req/min limit triggered during normal MCP negotiation, causing connection failures.

The Solution

1. nginx SSE Location Block

Added dedicated location block for MCP SSE endpoints:

# MCP SSE endpoint - MUST be before generic /api/
location ~ ^/api/v1/mcp/(sse|message) {
    proxy_pass ${API_PROXY_URL};
    proxy_http_version 1.1;

    # Disable buffering for SSE (critical for real-time events)
    proxy_buffering off;
    proxy_cache off;
    proxy_set_header X-Accel-Buffering "no";

    # Extended timeouts for long-running SSE (4 hours)
    proxy_read_timeout 14400s;
    proxy_send_timeout 14400s;

    # Connection headers for SSE
    proxy_set_header Connection '';
    chunked_transfer_encoding on;
}

Key configurations:

proxy_buffering off: Events stream immediately
X-Accel-Buffering: no: Header for upstream servers
14400s timeouts: 4-hour sessions for long operations
Connection '': Prevent connection header interference

2. Split Rate Limiting

Created separate rate limiters for SSE and messages:

# SSE: Connection-based limit (5 concurrent per user)
class MCPSSERateLimiter:
    def __init__(self, max_connections: int = 5):
        self.max_connections = max_connections

    async def acquire(self, user_id: str, connection_id: str) -> bool:
        """Acquire a connection slot."""
        # Uses Redis SET for atomic connection counting

# Messages: Token bucket (200 req/min, 50 burst)
class MCPMessageRateLimiter:
    def __init__(self, rate_limit: int = 200, burst_limit: int = 50):
        self.rate_limit = rate_limit
        self.burst_limit = burst_limit

    async def check(self, user_id: str) -> bool:
        """Check if message is allowed."""
        # Uses Redis token bucket algorithm

Why split limits?

Endpoint	Limit Type	Value	Reason
SSE `/sse`	Concurrent	5 per user	Long-lived connections, prevent resource exhaustion
POST `/message/{id}`	Token bucket	200/min, 50 burst	Handle chatty negotiation, allow burst

3. Extended Session TTL

MCP sessions now have 4-hour TTL to match nginx timeouts:

# backend/api/v1/mcp_sse.py
MCP_SESSION_TTL_SECONDS = 14400  # 4 hours
SSE_READ_TIMEOUT_SECONDS = 14400  # Matches nginx config

Implementation Details

Rate Limiter Storage

Uses Redis DB 4 (separate from API rate limiting DB 3):

MCP_RATE_LIMIT_DB = 4

# SSE: Uses Redis SET for connection tracking
key = f"mcp_sse:{user_id}:connections"
# SET contains active connection_ids

# Messages: Uses Redis HASH for token bucket
key = f"mcp_msg:{user_id}"
# HASH contains {tokens: N, last_refill: timestamp}

Rate Limiter Release

Crucial: Release SSE slot when connection closes:

async def remove_connection(self, connection_id: str):
    # ... disconnect logic ...

    # Release the SSE rate limit slot
    user_key = connection.user_info.email
    await sse_limiter.release(user_key, connection_id)

Without this, users would exhaust their connection limit and be unable to reconnect.

Impact

Metric	Before	After
SSE event latency	Buffered (seconds)	Real-time (<100ms)
MCP negotiation	Often rate limited	Reliable
Session duration	30 minutes	4 hours
Concurrent connections	No limit	5 per user
ADR-040 verdict	CONDITIONAL	APPROVED

Lessons Learned

SSE needs special handling: Standard proxy configs don't work for SSE
Different endpoints, different limits: API rate limits don't fit all protocols
Match timeouts end-to-end: nginx, backend, and client must agree
Resource cleanup matters: Release rate limit slots on disconnect

Issue #460 | ADR-040 | LLM Council Blocking Issue Resolved

DataGrid Pro Stability: Preventing Infinite Loops and Cascade Updates

February 10, 2026 · 4 min read

SRE Operations Platform

Published: 2025-01-16

When the LLM Council reviewed our DataGrid Pro implementation (ADR-035), they identified critical stability issues: missing getRowId props causing row identity recreation, unstable query keys triggering unnecessary refetches, and unthrottled filter/sort handlers causing cascade updates.

This post details how Issue #462 fixed these DataGrid stability issues.

The Problem

Problem 1: Missing getRowId Causes Row Identity Recreation

Without an explicit getRowId prop, DataGrid Pro uses array index as the row identifier:

// Before: Row identity based on array index
<DataGridPremium
  rows={data}
  columns={columns}
/>

When the data array is recreated (even with the same values), DataGrid treats all rows as new, causing:

Loss of selection state
Scroll position reset
Unnecessary DOM reconciliation
Potential infinite loops with controlled components

Problem 2: Unstable Query Keys

React Query triggers refetches when query key references change:

// Problematic: New object reference every render
const queryKey = ['entities', entityType, { page, filters }];

Combined with DataGrid's server-side pagination callbacks, this creates a feedback loop:

DataGrid triggers onPageChange
New query key object created
React Query refetches
New data causes DataGrid to re-render
Goto step 1

Problem 3: Unthrottled Filter/Sort Handlers

DataGrid fires onFilterModelChange and onSortModelChange rapidly during user interaction:

// Before: Every keystroke triggers API call
const handleFilterChange = (model) => {
  setFilters(model);  // Immediate state update
  refetch();          // Immediate API call
};

This causes:

Excessive API calls during typing
UI lag from rapid state updates
Server load from redundant requests

The Solution

1. Explicit getRowId Prop

Added getRowId to all DataGrid instances:

<DataGridPremium
  // ADR-035: Explicit getRowId prevents row identity recreation
  getRowId={(row) => row.id}
  rows={safeData}
  columns={columns}
/>

Why this works: The entity's actual ID (UUID or database ID) is used as the row key instead of array position. Row identity is stable across data updates.

2. Stable Query Key Hook

Created useStableQueryKey for deep comparison of query keys:

// frontend/src/hooks/useStableQueryKey.ts
export function useStableQueryKey<T>(queryKey: T): T {
  const keyRef = useRef<T>(queryKey);

  return useMemo(() => {
    // Only update reference if values actually changed
    if (!isEqual(keyRef.current, queryKey)) {
      keyRef.current = queryKey;
    }
    return keyRef.current;
  }, [queryKey]);
}

Usage:

const queryKey = useStableQueryKey(['entities', entityType, filters]);
// queryKey reference only changes when values actually differ

3. Throttled Handlers

Created useThrottle hook for rate-limited callbacks:

// frontend/src/hooks/useThrottle.ts
export function useThrottle<T extends (...args: any[]) => void>(
  callback: T,
  delay: number,
  options: { leading?: boolean; trailing?: boolean } = {}
): T {
  // Throttle implementation with leading/trailing edge support
}

Applied to DataGrid:

const handleFilterChangeInternal = useCallback((model) => {
  setFilters(model);
  onFilterChange?.(model);
}, [onFilterChange]);

// ADR-035: Throttle to 300ms prevents cascade updates
const handleFilterChange = useThrottle(handleFilterChangeInternal, 300, {
  leading: true,
  trailing: true
});

<DataGridPremium
  onFilterModelChange={handleFilterChange}
/>

4. URL State Synchronization

Created useGridUrlState for shareable grid configurations:

// frontend/src/hooks/useGridUrlState.ts
export function useGridUrlState(defaults) {
  const [searchParams, setSearchParams] = useSearchParams();

  // Parse pagination, sort, search from URL
  const state = useMemo(() => ({
    pagination: { page, pageSize },
    sort: [{ field, sort }],
    search
  }), [searchParams]);

  // Update URL without navigation
  const setPagination = (model) => {
    setSearchParams((prev) => {
      prev.set('page', model.page);
      return prev;
    }, { replace: true });
  };

  return { state, setPagination, setSort, setSearch, getApiParams };
}

Benefits:

Grid state preserved across page refreshes
Shareable URLs with filters/pagination
Browser back/forward navigation support

Implementation

Files Created

frontend/src/hooks/useStableQueryKey.ts
frontend/src/hooks/useThrottle.ts
frontend/src/hooks/useGridUrlState.ts
frontend/e2e/datagrid-stability.spec.ts
frontend/src/hooks/__tests__/useStableQueryKey.test.ts
frontend/src/hooks/__tests__/useThrottle.test.ts

Files Modified

frontend/src/components/tables/MUIEntityTable.tsx
- Added getRowId={(row) => row.id}
- Added throttled filter/sort handlers
- Imported and applied useThrottle
frontend/src/components/DataGridWrapper.tsx
- Added getRowId={(row) => row.id}
- Added memoized rows with useMemo
- Added throttled handlers

Impact

Metric	Before	After
Row identity stability	Index-based	ID-based
Filter change API calls	1 per keystroke	Max 3 per second
Query key stability	New ref each render	Stable until values change
URL state persistence	None	Full pagination/sort/search
ADR-035 verdict	CONDITIONAL	APPROVED

Key Takeaways

Always specify getRowId: DataGrid needs stable row identity for controlled components
Stabilize query keys: Use deep comparison to prevent unnecessary refetches
Throttle user interactions: Rate-limit handlers that trigger API calls
URL state enables sharing: Sync grid state to URL for better UX

Issue #462 | ADR-035 | LLM Council Blocking Issue Resolved

Building a Trading Bot Interface with Telegram and gRPC

January 21, 2026 · 6 min read

Claude

AI Assistant

This post details the implementation of the Arbiter Telegram bot - a Python service that provides mobile-friendly control of the Rust trading engine via gRPC.

Why Telegram?

We needed a mobile interface without building a native app. Telegram provides:

Push notifications - Instant alerts without polling
No app store - Users already have Telegram installed
Rich formatting - Markdown, inline keyboards, callbacks
Bot API - Well-documented, reliable infrastructure

The trade-off: dependency on Telegram's platform. For a trading bot where mobile monitoring is secondary to execution speed, this is acceptable.

Architecture Overview

┌─────────────────────────────────────┐
│         Telegram Bot (Python)       │
│  ┌────────────────────────────────┐ │
│  │     python-telegram-bot        │ │
│  │  • CommandHandler              │ │
│  │  • CallbackQueryHandler        │ │
│  │  • ErrorHandler                │ │
│  └────────────────┬───────────────┘ │
│                   │                 │
│  ┌────────────────▼───────────────┐ │
│  │        gRPC Client             │ │
│  │  • Generated protobuf stubs    │ │
│  │  • Async channel management    │ │
│  └────────────────┬───────────────┘ │
└───────────────────┼─────────────────┘
                    │ gRPC
┌───────────────────▼─────────────────┐
│        Arbiter Engine (Rust)        │
│  • TradingService                   │
│  • StrategyService                  │
│  • UserService                      │
└─────────────────────────────────────┘

Handler Pattern

Every command follows the same pattern:

async def positions_handler(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Handle the /positions command."""
    if not update.message:
        return

    # 1. Get gRPC client from context
    client: ArbiterClient | None = context.bot_data.get("arbiter_client")
    if not client:
        await update.message.reply_text("Error: Backend not connected.")
        return

    try:
        # 2. Call backend via gRPC
        positions = await client.get_positions()

        # 3. Format response
        if not positions:
            await update.message.reply_text("No open positions.", parse_mode="Markdown")
            return

        message = format_positions(positions)
        await update.message.reply_text(message, parse_mode="Markdown")

    except ArbiterClientError as e:
        # 4. Handle errors gracefully
        logger.error("Failed to fetch positions", error=str(e))
        await update.message.reply_text(f"Error: {e}")

Key aspects:

Early return on missing message - Handles edge cases
Client from context - Shared connection, initialized once
Async gRPC calls - Non-blocking communication
Error boundaries - Never crash the handler

gRPC Client Wrapper

The raw generated stubs are wrapped in a client class:

class ArbiterClient:
    """Async gRPC client for Arbiter trading engine."""

    def __init__(self, address: str):
        self.address = address
        self._channel: grpc.aio.Channel | None = None
        self._trading_stub: TradingServiceStub | None = None
        self._strategy_stub: StrategyServiceStub | None = None

    async def connect(self) -> None:
        """Establish gRPC channel."""
        self._channel = grpc.aio.insecure_channel(self.address)
        self._trading_stub = TradingServiceStub(self._channel)
        self._strategy_stub = StrategyServiceStub(self._channel)

    async def get_positions(self) -> list[Position]:
        """Fetch all open positions."""
        if not self._trading_stub:
            raise ArbiterClientError("Not connected")

        try:
            response = await self._trading_stub.GetPositions(PositionsRequest())
            return [self._convert_position(p) for p in response.positions]
        except grpc.aio.AioRpcError as e:
            raise ArbiterClientError(f"gRPC error: {e.code()}") from e

Benefits:

Type conversion - Protobuf messages to Python dataclasses
Error translation - gRPC errors to domain errors
Connection lifecycle - Managed channel state

Inline Keyboards

Interactive buttons provide quick actions without typing commands:

async def home_handler(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Dashboard with quick action buttons."""
    # ... fetch data ...

    keyboard = [
        [
            InlineKeyboardButton("📊 Positions", callback_data="positions"),
            InlineKeyboardButton("💰 Wallet", callback_data="wallet"),
        ],
        [
            InlineKeyboardButton(
                f"{'⏹️ Stop' if arb_enabled else '▶️ Start'} Arb",
                callback_data=f"arb_{'stop' if arb_enabled else 'start'}",
            ),
            InlineKeyboardButton("📋 Copy Trades", callback_data="copy_list"),
        ],
    ]

    await update.message.reply_text(
        message,
        parse_mode="Markdown",
        reply_markup=InlineKeyboardMarkup(keyboard),
    )

Callback routing handles button presses:

async def callback_query_handler(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Route inline keyboard callbacks."""
    query = update.callback_query
    await query.answer()  # Acknowledge the callback

    match query.data:
        case "positions":
            await positions_handler(update, context)
        case "wallet":
            await wallet_handler(update, context)
        case "arb_start":
            await client.set_arb_enabled(True)
            await query.message.reply_text("🟢 Arbitrage started!")
        case "arb_stop":
            await client.set_arb_enabled(False)
            await query.message.reply_text("🔴 Arbitrage stopped!")

Subcommand Parsing

Commands with subcommands (like /arb start) use argument parsing:

async def arb_handler(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Route /arb subcommands."""
    args = context.args or []

    match args:
        case [] | ["status"]:
            await show_arb_status(update, context)
        case ["start"]:
            await arb_start_handler(update, context)
        case ["stop"]:
            await arb_stop_handler(update, context)
        case _:
            await update.message.reply_text(
                "*Arbitrage Commands:*\n"
                "/arb status - Show status\n"
                "/arb start - Enable engine\n"
                "/arb stop - Disable engine",
                parse_mode="Markdown",
            )

For commands with parameters:

async def copy_add_handler(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Handle /copy add <wallet> [allocation%] [max_position]."""
    args = context.args or []

    if len(args) < 1:
        await update.message.reply_text("Usage: /copy add <wallet> [alloc%] [max$]")
        return

    wallet_address = args[0]
    allocation = float(args[1]) if len(args) >= 2 else 10.0
    max_position = float(args[2]) if len(args) >= 3 else 100.0

    # Validate inputs
    if not 0 < allocation <= 100:
        await update.message.reply_text("Allocation must be 0-100%")
        return

    # Execute
    await client.add_copy_trade(wallet_address, allocation, max_position)

Application Lifecycle

The bot initializes the gRPC client during startup:

async def post_init(application: Application) -> None:
    """Initialize after application starts."""
    settings = get_settings()

    client = ArbiterClient(settings.grpc_address)
    await client.connect()

    application.bot_data["arbiter_client"] = client
    logger.info("Bot initialized", grpc_address=settings.grpc_address)


async def post_shutdown(application: Application) -> None:
    """Clean up on shutdown."""
    client = application.bot_data.get("arbiter_client")
    if client:
        await client.close()


def create_application() -> Application:
    """Build the Telegram application."""
    settings = get_settings()

    return (
        Application.builder()
        .token(settings.telegram_bot_token)
        .post_init(post_init)
        .post_shutdown(post_shutdown)
        .build()
    )

Configuration with Pydantic

Settings load from environment variables with validation:

from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    telegram_bot_token: str
    telegram_allowed_users: list[int] = []  # Empty = allow all

    grpc_host: str = "localhost"
    grpc_port: int = 50051

    log_level: str = "INFO"

    model_config = {"env_file": ".env"}

    @property
    def grpc_address(self) -> str:
        return f"{self.grpc_host}:{self.grpc_port}"

Pydantic provides:

Type coercion - Strings to ints, comma-separated to lists
Validation - Fail fast on invalid config
Defaults - Sensible fallbacks
Documentation - Self-describing fields

Testing Strategy

Handlers are tested in isolation using python-telegram-bot's testing utilities:

import pytest
from unittest.mock import AsyncMock, MagicMock

@pytest.fixture
def mock_client():
    """Create mock gRPC client."""
    client = AsyncMock(spec=ArbiterClient)
    client.get_positions.return_value = [
        Position(market_id="BTC-50K", side="long", size=100, pnl=50.0)
    ]
    return client


@pytest.mark.asyncio
async def test_positions_handler_shows_positions(mock_client):
    """Test /positions command displays positions."""
    update = MagicMock()
    update.message.reply_text = AsyncMock()

    context = MagicMock()
    context.bot_data = {"arbiter_client": mock_client}

    await positions_handler(update, context)

    # Verify gRPC call
    mock_client.get_positions.assert_called_once()

    # Verify response
    call_args = update.message.reply_text.call_args
    assert "BTC-50K" in call_args[0][0]
    assert "$50.00" in call_args[0][0]

Test coverage includes:

Category	Tests
Command handlers	35
Callback handlers	15
Configuration	5
Error handling	5
Total	60

Lessons Learned

Context is your friend - Store shared resources in bot_data
Async all the way - Don't block the event loop
Error boundaries - Handle every gRPC failure gracefully
Markdown escaping - User input can break formatting
Callback data limits - 64 bytes max, use IDs not full data

Future Improvements

Conversation handlers - Multi-step wizards for complex actions
Push notifications - Alert on significant P&L changes
Rate limiting - Per-user command throttling
Localization - Multi-language support

References

ADR-008: Control Interface - Architecture decision
Telegram Bot Guide - User documentation
python-telegram-bot docs - Framework reference

Building Template Management Tooling: ADR-007

January 5, 2026 · 4 min read

Amiable Dev

Project Contributors

How we built a CLI tool and Claude Code skill to manage our template registry with three levels of validation.

The Problem

ADR-003 gave us a declarative template registry (templates.yaml), but managing it was painful:

Error-prone: Nested YAML structures are easy to mess up
Undiscoverable: New contributors didn't know required fields
No feedback: Errors only surfaced during CI builds
Manual validation: Run JSON Schema checks by hand

We needed tooling for both humans and LLMs to manage templates reliably.

The Solution: Hybrid Approach

We evaluated four options:

Option	Verdict
Claude Code Skills only	Limited to Claude Code users
MCP Server	Overkill for 3 templates
Makefile only	No guided prompts
Hybrid (Skills + Makefile + CLI)	Best of all worlds

The hybrid approach uses a single Python CLI as the canonical implementation, with both Skills and Makefile as interfaces.

The CLI: `template_manager.py`

All operations go through one entry point:

# Validation
python scripts/template_manager.py validate
python scripts/template_manager.py validate --deep  # Network checks

# List templates
python scripts/template_manager.py list
python scripts/template_manager.py list --category observability --format json

# CRUD operations
python scripts/template_manager.py add --id my-template --repo owner/repo ...
python scripts/template_manager.py update my-template --tier production
python scripts/template_manager.py remove old-template

Why One CLI?

Single source of truth: Skills and Makefile both call the same code
Testable: 54 unit tests cover all operations
Consistent: Same validation logic everywhere

Three Levels of Validation

Not all validation is equal. We separated checks by speed and importance:

Level	When	What	Blocking?
Level 1: Schema	Always	JSON Schema conformance, types, required fields	Yes
Level 2: Semantic	Always	Unique IDs, valid category refs, HTTPS URLs	Yes
Level 3: Network	`--deep` only	URL reachability, GitHub repo existence	No (warning)

Level 3 is opt-in because network checks are slow and external services can be flaky:

$ python scripts/template_manager.py validate --deep

Network warnings:
  - Template 'litellm-langfuse-starter' links.railway_template not found (404)
Validation passed: templates.yaml

The template is still valid—we just warn about the broken link.

Claude Code Skill Integration

For LLM-assisted workflows, we created a skill at .claude/skills/template-registry/:

template-registry/
├── SKILL.md           # Main instructions (safety rules, CLI commands)
├── schema-reference.md # Field documentation
└── examples.md        # Common patterns

The skill teaches Claude to use the CLI safely:

## Important Safety Rules
ALWAYS run validation before any write operation
NEVER commit directly to main - create a branch/PR
Treat all LLM outputs as untrusted until validated

Now you can ask Claude: "Add a new template for my-awesome-project" and it will:

Use the CLI with proper arguments
Run validation
Create a branch and PR

Security Hardening

LLM-assisted editing introduces risks. We added multiple protections:

Input Validation

# Reject malformed GitHub owner/repo names
GITHUB_OWNER_PATTERN = re.compile(r"^[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,37}[a-zA-Z0-9])?$")
GITHUB_REPO_PATTERN = re.compile(r"^[a-zA-Z0-9._-]{1,100}$")

Symlink Protection

# Reject symlinks to prevent LFI attacks
fd = os.open(str(path), os.O_RDONLY | os.O_NOFOLLOW)

YAML Hardening

yaml.allow_duplicate_keys = False  # Catch accidental overwrites

Atomic Writes

# Write to temp file, then atomic rename
fd, temp_path = tempfile.mkstemp(dir=path.parent)
# ... write content ...
os.replace(temp_path, path)  # Atomic on POSIX

Makefile Integration

For automation and CI, everything is available via make:

make validate        # Level 1 + 2
make validate-deep   # Level 1 + 2 + 3
make templates       # List all
make templates-json  # JSON output
make help           # Show all targets

The build target runs validation first:

build: validate
	python scripts/aggregate_templates.py
	mkdocs build --strict

Pre-commit Hook

Validation runs automatically before commits:

# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: template-manager-validate
      name: Validate templates.yaml (semantic)
      entry: python scripts/template_manager.py validate
      files: ^templates\.yaml$

Now invalid templates can't even be committed locally.

What We Learned

One CLI, many interfaces: Skills and Makefile are just wrappers
Tiered validation saves time: Fast checks always, slow checks on demand
LLMs need guardrails: Validation-first prevents hallucinated YAML
Atomic operations matter: Temp file + rename prevents corruption

Implementation Stats

4 phases over 2 days
54 tests with full coverage
1,053 lines of Python
16 GitHub issues tracked and closed

What's Next

MCP Server: Reconsider at 20+ templates (current: 3)
Template Linting: Check for common misconfigurations
Auto-sync: Fetch metadata from Railway API

Links:

Building an OSS Foundation: ADR-001 Implementation

January 3, 2026 · 3 min read

Amiable Dev

Project Contributors

How we established community standards for amiable-templates using Architecture Decision Records (ADRs) and multi-model AI review.

!!! info "What's an ADR?" An Architecture Decision Record documents significant technical decisions with context, options considered, and rationale. It creates a searchable history of why things are the way they are.

The Problem

We're building amiable-templates to aggregate deployment templates for AI infrastructure into a single portal. Before writing any aggregation code, we needed to answer: How do we structure an OSS project that invites contribution?

Starting from scratch means making a lot of decisions:

What license?
How do contributors know what's expected?
How do we handle security reports?
What governance model fits a small project?

The Solution: Adopt Proven Patterns

Instead of reinventing the wheel, we borrowed the existing OSS ADR-033, which had already been reviewed with the LLM Council and battle-tested for llm-council.dev.

The Files

File	Purpose
`LICENSE`	MIT - maximum flexibility
`CODE_OF_CONDUCT.md`	Contributor Covenant v2.1
`CONTRIBUTING.md`	How to contribute
`SECURITY.md`	48hr target response time
`GOVERNANCE.md`	Decision-making process
`SUPPORT.md`	Where to get help

GitHub Configuration

.github/
├── CODEOWNERS           # Auto-assign reviewers
├── dependabot.yml       # Keep deps updated
├── ISSUE_TEMPLATE/      # Structured bug reports
└── PULL_REQUEST_TEMPLATE.md

Example: CODEOWNERS

Here's how we route reviews to the right people:

# Default: maintainers review everything
* @amiable-dev/maintainers

# Critical config requires explicit maintainer approval
templates.yaml @amiable-dev/maintainers
mkdocs.yml @amiable-dev/maintainers

# CI/CD changes are sensitive
.github/ @amiable-dev/maintainers

# ADRs need architectural review
docs/adrs/ @amiable-dev/maintainers

This means any PR touching templates.yaml (our template registry) automatically requests review from maintainers. As the project grows, we can split ownership - e.g., docs/ @docs-team.

The Interesting Part: LLM Council Review

We used LLM Council to review our ADR before accepting it. LLM Council is an MCP server that queries multiple AI models in parallel, has them critique each other's responses, and synthesizes a consensus verdict.

Four models (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, Grok 4.1) reviewed our draft ADR:

What they caught:

Finding	Our Response
Missing CI/CD workflows	Added deploy.yml and security.yml
GOVERNANCE.md premature for solo project	Simplified, will expand at 3+ maintainers
Need template intake policy	Added to CONTRIBUTING.md

The full review is documented in ADR-001.

Tracking It All

We used GitHub Issues to track implementation:

Epic: #5 - Complete OSS Foundation
Sub-issues: Labels (#6), Branch Protection (#7), Blog (#8), etc.

This gives visibility into what's done and what's remaining.

What's Next

With the foundation in place, we're moving through the remaining ADRs:

ADR-002: MkDocs site architecture
ADR-003: Template configuration system
ADR-004: CI/CD & deployment
ADR-005: DevSecOps implementation
ADR-006: Cross-project documentation aggregation

Each follows the same process: draft, LLM Council review, implement, document.

Links:

Choosing MkDocs Material: ADR-002 Site Architecture

January 3, 2026 · 4 min read

Amiable Dev

Project Contributors

Why we chose MkDocs Material over Docusaurus or a custom solution, and how we structured the site.

The Problem

We needed a documentation site that could showcase templates in a scannable, attractive format. The site also needed to aggregate documentation from multiple template repositories, provide excellent search, support dark/light mode for accessibility, and be easy for contributors to work with.

Three options emerged: MkDocs Material, Docusaurus, or a custom Next.js/Astro site.

Why MkDocs Material?

Criteria	MkDocs Material	Docusaurus	Custom
Stack alignment	Python (matches scripts)	React/Node	Varies
Setup time	Hours	Hours	Days/Weeks
Maintenance	Low	Medium	High
Search	Built-in (lunr.js)	Algolia needed	Build it
Dark mode	Built-in	Built-in	Build it

Why not Docusaurus? It's a great framework, and we use it in our own blog amiable.dev but it would introduce React/Node into a Python-focused project. Our aggregation scripts are Python, and having a consistent stack reduces cognitive load.

Why not custom? A Next.js or Astro site would give us full control, but it's overkill for documentation. We'd spend weeks building what MkDocs Material gives us out of the box.

The deciding factor: consistency. Our llm-council docs already use MkDocs Material. Same tooling, same patterns, same contributor experience.

The Architecture

docs/
├── index.md          # Hero + featured templates
├── quickstart.md     # Prominent, top-level
├── templates/        # Template grid + aggregated docs
├── adrs/             # Architecture decisions
├── blog/             # You're reading it
└── stylesheets/
    └── extra.css     # Hero + grid styling

We use top-level tabs for main sections:

nav:
  - Home: index.md
  - Quick Start: quickstart.md
  - Templates: templates/index.md
  - ADRs: adrs/index.md
  - Contributing: contributing.md
  - Blog: blog/index.md

Quick Start gets its own tab because that's what most visitors want.

The Template Grid

We wanted a scannable grid of template cards without any JavaScript complexity. Here's how we built it using pure markdown with custom CSS:

<div class="template-grid" markdown="1">

<div class="template-card" markdown="1">

### LiteLLM + Langfuse Starter

Production-ready LLM proxy with observability.

**Features:**
- 100+ LLM providers via LiteLLM
- Request tracing with Langfuse
- Cost tracking and analytics

**Estimated Cost:** ~$29-68/month

[:octicons-rocket-16: Deploy](https://railway.app/template/...)
[:octicons-mark-github-16: Source](https://github.com/amiable-dev/litellm-langfuse-railway)

</div>

<div class="template-card" markdown="1">

### Another Template

Description here...

</div>

</div>

The markdown attribute is key—it tells MkDocs Material to process the markdown inside the HTML divs.

The CSS does the heavy lifting:

.template-grid {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(320px, 1fr));
  gap: 1.5rem;
}

.template-card {
  border: 1px solid var(--md-default-fg-color--lightest);
  border-radius: 0.5rem;
  padding: 1.5rem;
}

.template-card:hover {
  box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);
  transform: translateY(-2px);
}

No framework needed. Pure CSS grid with markdown content.

Dark Mode

MkDocs Material handles dark mode elegantly with palette configuration. Users get automatic detection based on their system preference, plus a toggle to override:

theme:
  name: material
  palette:
    - media: "(prefers-color-scheme: light)"
      scheme: default
      primary: deep purple
      accent: deep purple
      toggle:
        icon: material/brightness-7
        name: Switch to dark mode
    - media: "(prefers-color-scheme: dark)"
      scheme: slate
      primary: deep purple
      accent: deep purple
      toggle:
        icon: material/brightness-4
        name: Switch to light mode

The toggle icon appears in the header. No JavaScript to write, no state to manage—it just works.

What We Learned

Start with constraints: "No JavaScript" forced simpler, more maintainable solutions
Reuse organizational patterns: Same theme as llm-council = less cognitive load
Put Quick Start first: Most visitors want to deploy, not read architecture docs

What's Next

ADR-003: Template configuration system (templates.yaml)
ADR-006: Cross-project documentation aggregation

Links:

Designing the Template Registry: ADR-003

January 3, 2026 · 4 min read

Amiable Dev

Project Contributors

How we built a declarative configuration system for Railway templates with JSON Schema validation.

The Problem

We needed a way to register templates without touching code. Every new template shouldn't require modifying Python scripts or HTML—just add an entry to a configuration file and the site rebuilds.

The configuration needed to:

Define which templates appear on the site
Specify where to find documentation in each repo
Provide metadata for the template grid (features, cost, tags)
Catch errors before they reach production

Why YAML?

We considered three formats:

Format	Pros	Cons
YAML	Human-readable, comments allowed, matches mkdocs.yml	Syntax can be tricky
TOML	Matches Railway's railway.toml	Less common in Python ecosystem
JSON	Strict, universal	No comments, verbose

YAML won because:

Consistency: Our mkdocs.yml is already YAML
Comments: We can document inline why certain fields exist
Readability: Non-engineers can understand and edit it

The Schema

Here's what a template entry looks like:

templates:
  - id: litellm-langfuse-starter
    repo:
      owner: "amiable-dev"
      name: "litellm-langfuse-railway"
    title: "LiteLLM + Langfuse Starter"
    description: "Production-ready LLM gateway with observability"
    category: observability
    tags:
      - litellm
      - langfuse
    directories:
      docs:
        - path: "starter/README.md"
          target: "overview.md"
    links:
      railway_template: "https://railway.app/template/..."
      github: "https://github.com/amiable-dev/litellm-langfuse-railway"
    features:
      - "100+ LLM Providers"
      - "Cost Tracking"
    estimated_cost:
      min: 29
      max: 68
      currency: "USD"
      period: "month"

Required fields: id, repo, title, description, category, directories.docs

Everything else is optional.

Validation with JSON Schema

YAML is flexible, which means it's easy to make mistakes. We use JSON Schema to catch errors early:

# templates.schema.yaml
properties:
  templates:
    items:
      type: object
      additionalProperties: false  # Catch typos!
      required:
        - id
        - repo
        - title
        - description
        - category
        - directories
      properties:
        id:
          type: string
          pattern: "^[a-z][a-z0-9-]*$"
        # ...

The additionalProperties: false is important. Here's what happens with a typo:

# Bad config - spot the typo
templates:
  - id: my-template
    repo:
      owner: "amiable-dev"
      name: "my-repo"
    title: "My Template"
    description: "A template"
    category: observability
    directories:
      docs:
        - path: "README.md"
          target: "overview.md"
    featurs:  # Typo!
      - "Feature 1"

With additionalProperties: false, the schema rejects this:

$.templates[0]: Additional properties are not allowed ('featurs' was unexpected)

Without it, the typo would silently pass—and the aggregation script would just skip the field.

CI Integration

Validation runs on every PR:

# .github/workflows/validate.yml
- name: Validate templates.yaml schema
  run: check-jsonschema --schemafile templates.schema.yaml templates.yaml

- name: Validate unique template IDs
  run: |
    python -c "
    import yaml
    with open('templates.yaml') as f:
        config = yaml.safe_load(f)
    ids = [t['id'] for t in config.get('templates', [])]
    duplicates = [id for id in ids if ids.count(id) > 1]
    if duplicates:
        exit(1)
    "

Error messages are actionable:

$.templates[0]: 'repo' is a required property
$.templates[0].id: 'INVALID_ID' does not match '^[a-z][a-z0-9-]*$'

The Aggregation Flow

CI validates templates.yaml against the JSON Schema
Python aggregator reads the config (YAML → Python dict)
For each template, fetches docs from directories.docs paths
Transforms content (rewrites links, adds attribution)
Writes to docs/templates/{id}/
MkDocs builds the static site

The script uses .get() for optional fields, so missing features or estimated_cost don't break the build.

What We Learned

Strict schemas catch bugs early: additionalProperties: false is your friend
Validate on PRs, not just deploys: Developers should see errors before merge
JSON Schema can't do everything: We added a Python check for unique IDs

What's Next

ADR-006: Cross-project documentation aggregation (the script that reads this config)

Links:

The Problem​

The Solution​

1. Critical Path Manifest​

2. Shared Query Options​

3. Authentication-Triggered Preloading​

4. App Integration​

Implementation Details​

Results​

E2E Testing​

Lessons Learned​

The Problem​

The Solution​

Weighted Scoring Algorithm​

Multi-Strategy Agreement​

Entity-Specific Thresholds​

Cascading Execution​

Implementation​

Results​

Lessons Learned​

The Problem​

Problem 1: Blocked Multiple Relationship Types​

Problem 2: Potential Duplicate Relationships​

Consequences​

The Solution​

Key Insight: Multiple Relationship Types Are Now Valid​

Migration Strategy​

Testing​

Impact​

Lessons Learned​

The Problem​

Problem 1: nginx Buffering Blocks SSE​

Problem 2: Rate Limits Break MCP Negotiation​

The Solution​

1. nginx SSE Location Block​

2. Split Rate Limiting​

3. Extended Session TTL​

Implementation Details​

Rate Limiter Storage​

Rate Limiter Release​

Impact​

Lessons Learned​

The Problem​

Problem 1: Missing getRowId Causes Row Identity Recreation​

Problem 2: Unstable Query Keys​

Problem 3: Unthrottled Filter/Sort Handlers​

The Solution​

1. Explicit getRowId Prop​

2. Stable Query Key Hook​

3. Throttled Handlers​

4. URL State Synchronization​

Implementation​

Files Created​

Files Modified​

Impact​

Key Takeaways​

Why Telegram?​

Architecture Overview​

Handler Pattern​

gRPC Client Wrapper​

Inline Keyboards​

Subcommand Parsing​

Application Lifecycle​

Configuration with Pydantic​

Testing Strategy​

Lessons Learned​

Future Improvements​

References​

The Problem​

The Solution: Hybrid Approach​

The CLI: template_manager.py​

Why One CLI?​

Three Levels of Validation​

Claude Code Skill Integration​

Security Hardening​

Input Validation​

Symlink Protection​

YAML Hardening​

Atomic Writes​

Makefile Integration​

Pre-commit Hook​

The Problem

The Solution

1. Critical Path Manifest

2. Shared Query Options

3. Authentication-Triggered Preloading

4. App Integration

Implementation Details

Results

E2E Testing

Lessons Learned

The Problem

The Solution

Weighted Scoring Algorithm

Multi-Strategy Agreement

Entity-Specific Thresholds

Cascading Execution

Implementation

Results

Lessons Learned

The Problem

Problem 1: Blocked Multiple Relationship Types

Problem 2: Potential Duplicate Relationships

Consequences

The Solution

Key Insight: Multiple Relationship Types Are Now Valid

Migration Strategy

Testing

Impact

Lessons Learned

The Problem

Problem 1: nginx Buffering Blocks SSE

Problem 2: Rate Limits Break MCP Negotiation

The Solution

1. nginx SSE Location Block

2. Split Rate Limiting

3. Extended Session TTL

Implementation Details

Rate Limiter Storage

Rate Limiter Release

Impact

Lessons Learned

The Problem

Problem 1: Missing getRowId Causes Row Identity Recreation

Problem 2: Unstable Query Keys

Problem 3: Unthrottled Filter/Sort Handlers

The Solution

1. Explicit getRowId Prop

2. Stable Query Key Hook

3. Throttled Handlers

4. URL State Synchronization

Implementation

Files Created

Files Modified

Impact

Key Takeaways

Why Telegram?

Architecture Overview

Handler Pattern

gRPC Client Wrapper

Inline Keyboards

Subcommand Parsing

Application Lifecycle

Configuration with Pydantic

Testing Strategy

Lessons Learned

Future Improvements

References

The Problem

The Solution: Hybrid Approach

The CLI: `template_manager.py`

Why One CLI?

Three Levels of Validation

Claude Code Skill Integration

Security Hardening

Input Validation

Symlink Protection

YAML Hardening

Atomic Writes

Makefile Integration

Pre-commit Hook