ADR-026: Dynamic Model Intelligence and Benchmark-Driven Selection

Status: APPROVED (Blocking Conditions Implemented) Date: 2025-12-23 Decision Makers: Engineering, Architecture Council Review: 2025-12-23 (Strategic + Technical Reviews) Layer Assignment: Cross-cutting (L1-L4 integration) Implementation: 2025-12-23 (Blocking Conditions 1-3)

⚠️ CRITICAL: Strategic Council Review - Vendor Dependency Risk

Verdict: CONDITIONAL APPROVAL

ADR-026 is NOT APPROVED in its current form. The council identified critical vendor dependency risks that must be addressed before implementation.

"We cannot build the core 'brain' of an open-source project on proprietary APIs that we do not control." — Council Consensus

The "Sovereign Orchestrator" Philosophy

The council unanimously adopts this architectural principle:

The open-source version of LLM Council must function as a complete, independent utility. External services (like OpenRouter or Not Diamond) must be treated as PLUGINS, not foundations.

If the internet is disconnected or if an API key is revoked, the software must still boot, run, and perform its core function (orchestrating LLMs), even if quality is degraded.

Blocking Conditions for Approval

#	Condition	Status	Priority
1	Add `ModelMetadataProvider` abstraction interface	✅ COMPLETED	BLOCKING
2	Implement `StaticRegistryProvider` (30+ models)	✅ COMPLETED (31 models)	BLOCKING
3	Add offline mode (`LLM_COUNCIL_OFFLINE=true`)	✅ COMPLETED	BLOCKING
4	Evaluate LiteLLM as unified abstraction	✅ COMPLETED (as fallback)	High
5	Document degraded vs. enhanced feature matrix	📋 Required	Medium

Implementation Notes (2025-12-23)

The blocking conditions were implemented using TDD (Test-Driven Development) with 86 passing tests.

Module Structure: src/llm_council/metadata/

File	Purpose
`types.py`	`ModelInfo` frozen dataclass, `QualityTier` enum, `Modality` enum
`protocol.py`	`MetadataProvider` `@runtime_checkable` Protocol
`static_registry.py`	`StaticRegistryProvider` with YAML + LiteLLM fallback
`litellm_adapter.py`	Lazy LiteLLM import for metadata extraction
`offline.py`	`is_offline_mode()` and `check_offline_mode_startup()`
`__init__.py`	`get_provider()` singleton factory, module exports

Bundled Registry: src/llm_council/models/registry.yaml

31 models from 8 providers:

OpenAI (7): gpt-4o, gpt-4o-mini, gpt-5.2-pro, o1, o1-preview, o1-mini, o3-mini
Anthropic (5): claude-opus-4.6, claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus, claude-3-sonnet
Google (5): gemini-3-pro-preview, gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash
xAI (2): grok-4, grok-4.1-fast
DeepSeek (2): deepseek-r1, deepseek-chat
Meta (2): llama-3.3-70b, llama-3.1-405b
Mistral (2): mistral-large-2411, mistral-medium
Ollama (6): llama3.2, mistral, qwen2.5:14b, codellama, phi3, deepseek-r1:8b

LiteLLM Integration: Used as fallback in the priority chain (local registry > LiteLLM > 4096 default). Lazy import prevents startup failures when LiteLLM is not installed.

GitHub Issues: #89-#92 (all completed)

Strategic Decision: Option C+D (Hybrid + Abstraction)

Feature	OSS (Self-Hosted)	Council Cloud (Commercial)
Model Metadata	Static library (LiteLLM) + Manual YAML config	Real-time dynamic sync via OpenRouter
Routing	Heuristic rules (latency/cost-based)	Intelligent ML-based (Not Diamond)
Integrations	BYOK (Bring Your Own Keys)	Managed Fleet (one bill, instant access)
Operations	`localhost` / Individual instance	Team governance, analytics, SSO

Vendor Dependency Analysis

Service	Current Role	Risk Level	Required Mitigation
OpenRouter	Metadata API, Gateway	HIGH	Static fallback + LiteLLM
Not Diamond	Model routing, Classification	MEDIUM	Heuristic fallback (exists)
Requesty	Alternative gateway	LOW	Already optional

Affiliate/Reseller Model: NOT VIABLE

"Reliance on affiliate revenue or tight coupling creates Platform Risk. If OpenRouter releases 'OpenRouter Agents,' Council becomes obsolete instantly. Furthermore, council-cloud cannot withstand margin compression." — Council

Decision: Use external services to lower the User's barrier to entry, not as the backbone of the Product's value.

Required Abstraction Architecture

MetadataProvider Interface (MANDATORY)

from typing import Protocol, Optional, Dict, List
from dataclasses import dataclass

@dataclass
class ModelInfo:
    id: str
    context_window: int
    pricing: Dict[str, float]  # {"prompt": 0.01, "completion": 0.03}
    supported_parameters: List[str]
    modalities: List[str]
    quality_tier: str  # "frontier" | "standard" | "economy"

class MetadataProvider(Protocol):
    """Abstract interface for model metadata sources."""

    def get_model_info(self, model_id: str) -> Optional[ModelInfo]: ...
    def get_context_window(self, model_id: str) -> int: ...
    def get_pricing(self, model_id: str) -> Dict[str, float]: ...
    def supports_reasoning(self, model_id: str) -> bool: ...
    def list_available_models(self) -> List[str]: ...

class StaticRegistryProvider(MetadataProvider):
    """Default: Offline-safe provider using bundled registry + LiteLLM."""

    def __init__(self, registry_path: Path = None):
        self.registry = self._load_registry(registry_path)
        self.litellm_data = self._load_litellm_model_map()

    def get_context_window(self, model_id: str) -> int:
        # 1. Check local config override
        if model_id in self.registry:
            return self.registry[model_id].context_window
        # 2. Check LiteLLM library
        if model_id in self.litellm_data:
            return self.litellm_data[model_id].context_window
        # 3. Safe default
        return 4096

class DynamicMetadataProvider(MetadataProvider):
    """Optional: Real-time metadata from OpenRouter API."""

    async def refresh(self) -> None:
        """Fetch latest model data - requires API key."""
        ...

Static Registry Schema (MANDATORY)

# models/registry.yaml - Shipped with OSS
version: "1.0"
updated: "2025-12-23"
models:
  - id: "openai/gpt-4o"
    context_window: 128000
    pricing:
      prompt: 0.0025
      completion: 0.01
    supported_parameters: ["temperature", "top_p", "tools"]
    modalities: ["text", "vision"]
    quality_tier: "frontier"

  - id: "anthropic/claude-opus-4.6"
    context_window: 200000
    pricing:
      prompt: 0.015
      completion: 0.075
    supported_parameters: ["temperature", "top_p", "tools", "reasoning"]
    modalities: ["text", "vision"]
    quality_tier: "frontier"

  - id: "ollama/llama3.2"
    provider: "ollama"
    context_window: 128000
    pricing:
      prompt: 0
      completion: 0
    modalities: ["text"]
    quality_tier: "local"

Offline Mode (MANDATORY)

# Force offline operation - MUST work without any external calls
export LLM_COUNCIL_OFFLINE=true

When offline mode is enabled:

Use StaticRegistryProvider exclusively
Disable all external metadata/routing calls
Log INFO message about limited/stale metadata
All core council operations MUST succeed

Technical Council Review Summary

Technical Review (2025-12-23) - Full Quorum

Model	Verdict	Rank	Response Time
Claude Opus 4.5	CONDITIONAL APPROVAL	#1	23.4s
Gemini 3 Pro	APPROVE	#2	31.4s
Grok 4	APPROVE	#3	59.6s
GPT-4o	APPROVE	#4	9.8s

"The council successfully identified Response C (Claude) as the superior review, noting its crucial detection of mathematical flaws (Borda normalization with variable pool sizes) and logical gaps (Cold Start) missed by other responses."

First Technical Review (2025-12-23, 3/4 models)

Approved Components:

Dynamic metadata integration via OpenRouter API (pricing, availability, capability detection)
Reasoning parameter optimization (reasoning_effort, budget_tokens)
Integration points with existing L1-L4 architecture

Returned for Revision (Now Resolved):

~~Benchmark scraping strategy~~ → Deferred to Phase 4, use Internal Performance Tracker
~~Single scoring algorithm with "magic number" weights~~ → Tier-Specific Weighting Matrices

Key Technical Recommendations

Recommendation	Status	Priority
Add Context Window as hard constraint	✅ Incorporated	Critical
Replace single scoring with Tier-Specific Weighting	✅ Incorporated	High
Defer benchmark scraping to optional Phase 4	✅ Incorporated	High
Add Anti-Herding logic	✅ Incorporated	Medium
Implement Internal Performance Tracker	✅ Incorporated	Medium
Cold Start handling for new models	📋 Documented	Medium
Borda score normalization	📋 Documented	Medium
Anti-Herding edge case (<3 models)	📋 Documented	Low

Council Consensus Points

Context Window is a hard pass/fail constraint - must filter before scoring, not weight
Tier-specific weighting is essential - quick tier prioritizes speed, reasoning tier prioritizes quality
Benchmark scraping is high-risk - external APIs change frequently, creates maintenance nightmare
Internal performance data is more valuable - track actual council session outcomes
Phased approach required - decouple metadata (proven value) from benchmark intelligence (speculative)
Cold Start needs exploration strategy - new models need "audition" mechanism (Phase 3)
LiteLLM strongly recommended - use as library for metadata, not just proxy

Context

Problem Statement

The LLM Council's current model selection relies on static configuration that quickly becomes stale in the rapidly evolving model landscape. December 2025 alone saw major releases from all frontier labs:

Release Date	Model	Provider
Nov 17, 2025	Grok 4.1	xAI
Nov 18, 2025	Gemini 3 Pro	Google
Nov 24, 2025	Claude Opus 4.5	Anthropic
Dec 11, 2025	GPT-5.2	OpenAI

Our tier pools in config.py reference models that may be:

Deprecated or renamed (model identifiers change)
Outperformed by newer models (benchmarks shift monthly)
Suboptimally configured (missing reasoning parameters)
Unavailable or rate-limited (provider status changes)

Current Architecture Gaps

Gap	Impact	Current State
Static tier pools	Stale model selection	Hardcoded in `config.py`
No benchmark integration	Suboptimal model-task matching	Manual updates
No model metadata	Missing capabilities detection	Assumed uniform
No reasoning parameters	Underutilized model capabilities	Default parameters only
No availability tracking	Failures on unavailable models	Reactive error handling

Existing Foundation (ADRs 020, 022, 024)

The architecture already supports dynamic model selection:

ADR	Component	Opportunity
ADR-020	Not Diamond integration	Model routing API exists but uses static candidates
ADR-022	Tier contracts	`allowed_models` field could be dynamically populated
ADR-024	Layer architecture	L1 tier selection could query external data sources

Decision

Implement a Model Intelligence Layer that provides real-time model metadata, benchmark rankings, and dynamic pool management to all routing layers.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MODEL INTELLIGENCE LAYER (New)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │ Model Registry  │  │ Benchmark Index │  │ Availability    │              │
│  │ (OpenRouter API)│  │ (Leaderboards)  │  │ Monitor         │              │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘              │
│           │                    │                    │                        │
│           └────────────────────┴────────────────────┘                        │
│                                │                                             │
│                    ┌───────────▼───────────┐                                 │
│                    │   Model Selector API   │                                │
│                    │   - get_tier_models()  │                                │
│                    │   - get_best_for_task()│                                │
│                    │   - get_model_params() │                                │
│                    └───────────┬───────────┘                                 │
│                                │                                             │
└────────────────────────────────┼─────────────────────────────────────────────┘
                                 │
        ┌────────────────────────┼────────────────────────────────┐
        │                        │                                │
        ▼                        ▼                                ▼
┌───────────────┐       ┌───────────────┐                ┌───────────────┐
│ L1: Tier      │       │ L2: Query     │                │ L4: Gateway   │
│ Selection     │       │ Triage        │                │ Routing       │
│ (ADR-022)     │       │ (ADR-020)     │                │ (ADR-023)     │
└───────────────┘       └───────────────┘                └───────────────┘

Data Sources

1. OpenRouter Models API

Endpoint: GET https://openrouter.ai/api/v1/models

Provides real-time model metadata:

{
  "id": "anthropic/claude-opus-4-6",
  "name": "Claude Opus 4.5",
  "pricing": {
    "prompt": "0.000015",
    "completion": "0.000075"
  },
  "context_length": 200000,
  "architecture": {
    "input_modalities": ["text", "image"],
    "output_modalities": ["text"]
  },
  "supported_parameters": ["temperature", "top_p", "reasoning"],
  "top_provider": {
    "is_moderated": true
  }
}

Key Fields for Selection:

pricing - Cost optimization
context_length - Long document handling
supported_parameters - Reasoning mode detection
input_modalities - Multimodal capability

2. Benchmark Leaderboards

Source	Data	Update Frequency	API
LMArena	Elo ratings from 5M+ votes	Real-time	Public
LiveBench	Monthly contamination-free benchmarks	Monthly	Public
Artificial Analysis	Speed, cost, quality metrics	Weekly	Public
LLM Stats	Aggregated performance data	Daily	Public

Benchmark Categories:

Reasoning: GPQA Diamond, AIME 2025, ARC-AGI-2
Coding: SWE-bench, LiveCodeBench, Terminal-Bench
General: MMLU-Pro, Humanity's Last Exam
Speed: Tokens/second, time-to-first-token

3. OpenRouter Rankings

Endpoint: GET https://openrouter.ai/rankings

Usage-based popularity metrics (tokens served, request count).

Model Parameter Optimization

Reasoning Mode Parameters

OpenRouter supports unified reasoning parameters:

# For reasoning-capable models (o1, o3, GPT-5, Claude with thinking)
request_params = {
    "reasoning": {
        "effort": "high",  # "minimal"|"low"|"medium"|"high"|"xhigh"
        "max_tokens": 32000,  # Budget for reasoning tokens
        "exclude": False,  # Include reasoning in response
    }
}

Effort Level Budget Calculation:

budget_tokens = max(min(max_tokens * effort_ratio, 32000), 1024)

effort_ratio:
  xhigh: 0.95
  high: 0.80
  medium: 0.50
  low: 0.20
  minimal: 0.10

Parameter Detection

def get_model_params(model_id: str, task_type: str) -> dict:
    """Get optimized parameters for model and task."""
    model_info = model_registry.get(model_id)

    params = {}

    # Enable reasoning for supported models on complex tasks
    if "reasoning" in model_info.supported_parameters:
        if task_type in ["reasoning", "math", "coding"]:
            params["reasoning"] = {
                "effort": "high" if task_type == "reasoning" else "medium"
            }

    # Adjust temperature for task type
    if task_type == "creative":
        params["temperature"] = 0.9
    elif task_type in ["coding", "math"]:
        params["temperature"] = 0.2

    return params

Dynamic Tier Pool Management

Tier Requirements Matrix

Tier	Latency Budget	Cost Ceiling	Min Models	Required Capabilities
quick	P95 < 10s	< $0.001/req	3	Fast inference
balanced	P95 < 45s	< $0.01/req	3-4	Good reasoning
high	P95 < 120s	< $0.10/req	4-5	Full capability
reasoning	P95 < 300s	< $1.00/req	3-4	Extended thinking

Dynamic Pool Selection Algorithm

Council Revision: Algorithm updated per council feedback to:

Add Context Window as hard pass/fail constraint
Replace global weights with Tier-Specific Weighting Matrices
Add Anti-Herding logic to prevent traffic concentration

@dataclass
class ModelScore:
    model_id: str
    benchmark_score: float  # Normalized 0-100 (optional, from internal tracker)
    latency_p95: float      # Seconds
    cost_per_request: float # USD
    availability: float     # 0-1
    diversity_score: float  # Provider diversity
    context_window: int     # Token limit (HARD CONSTRAINT)
    recent_traffic: float   # 0-1, for anti-herding

# COUNCIL RECOMMENDATION: Tier-Specific Weighting Matrices
# Replaces "magic number" global weights (0.4/0.2/0.2/0.1/0.1)
TIER_WEIGHTS = {
    "quick": {
        "latency": 0.45,      # Speed is primary concern
        "cost": 0.25,         # Budget-conscious
        "quality": 0.15,      # Acceptable quality
        "availability": 0.10,
        "diversity": 0.05,
    },
    "balanced": {
        "quality": 0.35,      # Better quality
        "latency": 0.25,      # Still matters
        "cost": 0.20,         # Cost-aware
        "availability": 0.10,
        "diversity": 0.10,
    },
    "high": {
        "quality": 0.50,      # Quality is paramount
        "availability": 0.20, # Must be reliable
        "latency": 0.15,      # Acceptable wait
        "diversity": 0.10,    # Multiple perspectives
        "cost": 0.05,         # Cost secondary
    },
    "reasoning": {
        "quality": 0.60,      # Best possible quality
        "availability": 0.20, # Critical reliability
        "diversity": 0.10,    # Diverse reasoning
        "latency": 0.05,      # Patience for quality
        "cost": 0.05,         # Cost not a factor
    },
}

def select_tier_models(
    tier: str,
    task_domain: Optional[str] = None,
    count: int = 4,
    required_context: Optional[int] = None,  # NEW: context requirement
) -> List[str]:
    """Select optimal models for tier using multi-criteria scoring.

    Council-Validated Algorithm:
    1. Apply HARD CONSTRAINTS (pass/fail)
    2. Score using TIER-SPECIFIC weights
    3. Apply ANTI-HERDING penalty
    4. Ensure PROVIDER DIVERSITY
    """

    candidates = model_registry.get_available_models()
    tier_config = TIER_REQUIREMENTS[tier]
    weights = TIER_WEIGHTS[tier]

    # ===== HARD CONSTRAINTS (Pass/Fail) =====
    # Council Critical: Context window MUST be hard constraint, not weighted
    eligible = [
        m for m in candidates
        if m.latency_p95 <= tier_config.latency_budget
        and m.cost_per_request <= tier_config.cost_ceiling
        and m.availability >= 0.95
        # COUNCIL ADDITION: Context window as hard constraint
        and (required_context is None or m.context_window >= required_context)
    ]

    if not eligible:
        logger.warning(f"No models meet hard constraints for tier={tier}")
        return fallback_to_static_config(tier)

    # ===== SOFT SCORING (Tier-Specific Weights) =====
    scored = []
    for model in eligible:
        # Normalize scores to 0-1 range
        latency_score = 1 - (model.latency_p95 / tier_config.latency_budget)
        cost_score = 1 - (model.cost_per_request / tier_config.cost_ceiling)
        quality_score = model.benchmark_score / 100 if model.benchmark_score else 0.5

        score = (
            quality_score * weights["quality"] +
            latency_score * weights["latency"] +
            cost_score * weights["cost"] +
            model.availability * weights["availability"] +
            model.diversity_score * weights["diversity"]
        )

        # Domain boost (task-specific enhancement)
        if task_domain and task_domain in model.strengths:
            score *= 1.15

        # COUNCIL ADDITION: Anti-Herding Penalty
        # Prevent traffic concentration on popular models
        if model.recent_traffic > 0.3:  # More than 30% of recent traffic
            score *= (1 - (model.recent_traffic - 0.3) * 0.5)  # Up to 35% penalty

        scored.append((model.model_id, score))

    # ===== DIVERSITY ENFORCEMENT =====
    selected = select_with_diversity(scored, count, min_providers=2)

    return selected

Benchmark Score Normalization (DEFERRED - Phase 4)

Council Warning: This section describes external benchmark integration which is DEFERRED to Phase 4. Use Internal Performance Tracker (Phase 3) for quality scoring in initial releases.

# DEFERRED: Only implement after Internal Performance Tracker validates value
def normalize_benchmark_scores(model_id: str) -> float:
    """Aggregate benchmark scores into single quality metric.

    WARNING: External benchmark scraping is high-maintenance.
    Prefer Internal Performance Tracker for quality scoring.
    Only implement if internal metrics prove insufficient.
    """

    # Start with manual JSON snapshots, NOT automated scrapers
    scores = load_manual_benchmark_snapshot(model_id)

    if not scores:
        return None  # Fall back to internal metrics

    # Weighted aggregation (emphasize reasoning and coding)
    weights = {
        "lmarena_elo": 0.3,      # Human preference
        "livebench": 0.2,        # Contamination-free
        "gpqa_diamond": 0.25,    # Science reasoning
        "swe_bench": 0.25,       # Coding capability
    }

    normalized = sum(
        normalize_to_100(scores[k]) * weights[k]
        for k in weights
        if scores.get(k) is not None
    )

    return normalized

Integration Points

1. Layer 1 Enhancement (ADR-022)

# tier_contract.py modification
def create_tier_contract(tier: str, task_domain: Optional[str] = None) -> TierContract:
    """Create tier contract with dynamically selected models."""

    # Use Model Intelligence Layer instead of static config
    models = model_intelligence.select_tier_models(
        tier=tier,
        task_domain=task_domain,
        count=TIER_MODEL_COUNTS[tier],
    )

    # Get tier-appropriate aggregator
    aggregator = model_intelligence.get_aggregator_for_tier(tier)

    return TierContract(
        tier=tier,
        allowed_models=models,
        aggregator_model=aggregator,
        **get_tier_timeout(tier),
    )

2. Layer 2 Enhancement (ADR-020)

# not_diamond.py modification
async def route_with_intelligence(
    query: str,
    tier_contract: TierContract,
) -> RouteResult:
    """Route using Not Diamond + Model Intelligence."""

    # Get task-appropriate candidates from intelligence layer
    candidates = model_intelligence.select_tier_models(
        tier=tier_contract.tier,
        task_domain=classify_domain(query),
    )

    # Get optimized parameters for each candidate
    params = {
        model: model_intelligence.get_model_params(model, query)
        for model in candidates
    }

    # Route using Not Diamond (with enriched candidates)
    if is_not_diamond_available():
        result = await not_diamond.route(query, candidates)
        return RouteResult(
            model=result.model,
            params=params[result.model],
            confidence=result.confidence,
        )

    # Fallback to intelligence-based selection
    return RouteResult(
        model=candidates[0],
        params=params[candidates[0]],
        confidence=0.7,
    )

3. Gateway Enhancement (ADR-023)

# gateway/types.py modification
@dataclass
class GatewayRequest:
    model: str
    messages: List[CanonicalMessage]
    # New: Model-specific parameters from intelligence layer
    model_params: Optional[Dict[str, Any]] = None

    def apply_model_params(self) -> Dict[str, Any]:
        """Apply optimized parameters to request."""
        request = self.to_openai_format()
        if self.model_params:
            request.update(self.model_params)
        return request

Caching and Refresh Strategy

Cache Layers

Data	Cache TTL	Refresh Trigger
Model registry	1 hour	API call / manual
Benchmark scores	24 hours	Daily cron
Availability status	5 minutes	Health check failures
Latency metrics	15 minutes	Rolling window

Implementation

class ModelIntelligenceCache:
    def __init__(self):
        self.registry_cache = TTLCache(maxsize=500, ttl=3600)
        self.benchmark_cache = TTLCache(maxsize=100, ttl=86400)
        self.availability_cache = TTLCache(maxsize=500, ttl=300)

    async def refresh_registry(self):
        """Fetch latest model data from OpenRouter."""
        async with httpx.AsyncClient() as client:
            response = await client.get(
                "https://openrouter.ai/api/v1/models",
                headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
            )
            models = response.json()["data"]

            for model in models:
                self.registry_cache[model["id"]] = ModelInfo.from_api(model)

    async def refresh_benchmarks(self):
        """Fetch latest benchmark data from leaderboards."""
        # LMArena Elo
        lmarena = await fetch_lmarena_leaderboard()
        # LiveBench
        livebench = await fetch_livebench_scores()
        # Artificial Analysis
        aa = await fetch_artificial_analysis()

        # Merge and normalize
        for model_id in self.registry_cache:
            self.benchmark_cache[model_id] = BenchmarkData(
                lmarena_elo=lmarena.get(model_id),
                livebench=livebench.get(model_id),
                artificial_analysis=aa.get(model_id),
            )

Configuration

Environment Variables

# Model Intelligence Layer
LLM_COUNCIL_MODEL_INTELLIGENCE=true|false  # Enable dynamic selection
LLM_COUNCIL_BENCHMARK_SOURCE=lmarena|livebench|artificial_analysis|aggregate
LLM_COUNCIL_REFRESH_INTERVAL=3600  # Registry refresh interval (seconds)

# Fallback to static config if intelligence unavailable
LLM_COUNCIL_STATIC_FALLBACK=true|false

# Minimum benchmark score thresholds
LLM_COUNCIL_MIN_BENCHMARK_SCORE=60  # 0-100 normalized
LLM_COUNCIL_MIN_AVAILABILITY=0.95   # 0-1

# Provider diversity
LLM_COUNCIL_MIN_PROVIDERS=2  # Minimum distinct providers per tier

YAML Configuration

Council Revision: Updated to use tier-specific weights instead of global weights.

council:
  model_intelligence:
    enabled: true
    sources:
      openrouter_api: true
      # DEFERRED: External benchmark sources (Phase 4)
      # lmarena: false
      # livebench: false
      # artificial_analysis: false
      internal_performance: true  # Phase 3: Use council session outcomes

    refresh:
      registry_ttl: 3600
      # benchmark_ttl: 86400  # DEFERRED
      availability_ttl: 300
      performance_ttl: 3600  # Internal performance cache

    selection:
      # COUNCIL REVISION: Tier-specific weights instead of global weights
      tier_weights:
        quick:
          latency: 0.45
          cost: 0.25
          quality: 0.15
          availability: 0.10
          diversity: 0.05
        balanced:
          quality: 0.35
          latency: 0.25
          cost: 0.20
          availability: 0.10
          diversity: 0.10
        high:
          quality: 0.50
          availability: 0.20
          latency: 0.15
          diversity: 0.10
          cost: 0.05
        reasoning:
          quality: 0.60
          availability: 0.20
          diversity: 0.10
          latency: 0.05
          cost: 0.05

      constraints:
        min_providers: 2
        min_availability: 0.95
        max_cost_multiplier: 10  # vs cheapest option

      # COUNCIL ADDITION: Anti-Herding
      anti_herding:
        enabled: true
        traffic_threshold: 0.3  # 30% of recent traffic
        max_penalty: 0.35       # Up to 35% score reduction

    parameters:
      auto_reasoning: true  # Enable reasoning params when appropriate
      reasoning_effort_by_tier:
        quick: minimal
        balanced: low
        high: medium
        reasoning: high

    # COUNCIL ADDITION: Internal Performance Tracker
    performance_tracker:
      enabled: true
      store_path: "${HOME}/.llm-council/performance.jsonl"
      decay_days: 30
      min_samples_preliminary: 10
      min_samples_moderate: 30
      min_samples_high: 100

Risks and Mitigations

Council-Identified Risks (High Priority)

Risk	Likelihood	Impact	Mitigation
Benchmark scraper breakage	HIGH	HIGH	DEFER to Phase 4; use manual snapshots, not scrapers
Traffic herding	Medium	High	Anti-Herding penalty in selection algorithm
Context window violations	Medium	High	Hard constraint filter (not weighted)
Magic number weights	N/A	Medium	Tier-specific weight matrices

Original Risks (Updated)

Risk	Likelihood	Impact	Mitigation
External API unavailability	Medium	High	Static fallback, aggressive caching
~~Benchmark data staleness~~	~~Medium~~	~~Medium~~	DEFERRED: Internal Performance Tracker instead
Model identifier changes	High	Medium	Fuzzy matching, alias tracking
Over-optimization	Medium	Medium	Diversity constraints, Anti-Herding logic
Cold start latency	Low	Medium	Pre-warm cache on startup
~~Provider bias in benchmarks~~	~~Medium~~	~~Low~~	DEFERRED: Internal metrics not susceptible
Internal metric bias	Medium	Medium	Minimum sample size requirements, decay weighting

Success Metrics

Phase 1 Success Metrics (Model Metadata Layer)

Metric	Target	Measurement
Registry availability	> 99% uptime	Track OpenRouter API failures
Context window violations	0 errors	Monitor "context exceeded" errors
Static fallback activation	< 1% of requests	Track fallback usage
Model freshness	< 1 hour stale	Track registry refresh success

Phase 2 Success Metrics (Reasoning Parameters)

Metric	Target	Measurement
Parameter utilization	100% for reasoning tier	Track reasoning param usage
Budget token efficiency	> 80% utilization	Compare budget vs actual tokens
Reasoning quality	No regression	Compare rubric scores before/after

Phase 3 Success Metrics (Internal Performance Tracker)

Metric	Target	Measurement
Session coverage	> 95% tracked	Count sessions with metrics
Internal metric correlation	> 0.6 with Borda	Validate internal scores vs outcomes
Model ranking stability	< 10% weekly variance	Track rank position changes
Selection improvement	> 5% higher Borda	Compare dynamic vs static selection

Overall Success Metrics

Metric	Target	Measurement
~~Benchmark correlation~~	~~> 0.8~~	DEFERRED: Internal metrics instead
Cost optimization	-15% vs static	Compare equivalent quality
Tier pool diversity	≥ 2 providers	Track provider distribution
Anti-Herding effectiveness	No model > 40% traffic	Monitor traffic distribution

Implementation Phases

Council Recommendation: Decouple proven value (metadata) from speculative value (benchmark intelligence). Implement in strict phases with validation gates.

Phase 1: Model Metadata Layer (v0.15.x) ✅ IMPLEMENTED

Goal: Dynamic model discovery and capability detection via OpenRouter API.

Status: ✅ COMPLETE (2025-12-23) GitHub Issues: #93, #94, #95 Tests: 79 TDD tests (cache: 20, client: 20, provider: 24, selection: 35)

Implement OpenRouter API client (src/llm_council/metadata/openrouter_client.py)
Cache model metadata with TTL (1 hour registry, 5 min availability)
- src/llm_council/metadata/cache.py: TTLCache, ModelIntelligenceCache
Add model capability detection (context window, reasoning support, modalities)
- src/llm_council/metadata/dynamic_provider.py: DynamicMetadataProvider
Add Context Window as hard constraint in tier filtering
- src/llm_council/metadata/selection.py: _meets_context_requirement()
Update get_tier_models() to use registry with static fallback
- src/llm_council/metadata/selection.py: select_tier_models()
Implement Anti-Herding logic with traffic tracking
- src/llm_council/metadata/selection.py: apply_anti_herding_penalty()
Add ModelIntelligenceConfig to unified_config.py
Add task_domain parameter to tier_contract.py

Environment Variables:

LLM_COUNCIL_MODEL_INTELLIGENCE=true enables dynamic selection
LLM_COUNCIL_OFFLINE=true forces static provider (takes precedence)

Validation Gate: ✅ PASSED

OpenRouter API client with timeout/error handling
Static fallback activates when API unavailable or offline mode enabled
All 1206 tests pass

Phase 1 "Hollow" Fix (2025-12-24):

Initial Phase 1 implementation used regex pattern matching ("hollow" implementation). Fixed to use real metadata from providers (Issues #105-#108).

Function	Before	After
`_get_provider_safe()`	N/A	Returns provider or None gracefully
`_get_quality_score_from_metadata()`	Regex patterns	Real QualityTier lookup
`_get_cost_score_from_metadata()`	Regex patterns	Real pricing data
`_meets_context_requirement()`	Always True	Real context window filtering

Quality Tier Scores:

FRONTIER: 0.95
STANDARD: 0.75
ECONOMY: 0.55
LOCAL: 0.40

Graceful Degradation: When metadata unavailable, falls back to heuristic estimates.

Phase 2: Reasoning Parameter Optimization (v0.16.x) ✅ IMPLEMENTED

Goal: Automatic reasoning parameter configuration for capable models.

Detect reasoning-capable models from registry metadata
Apply reasoning_effort parameter based on tier (quick=minimal, reasoning=high)
Calculate budget_tokens per effort level
Add task-specific parameter profiles (math→high effort, creative→minimal)
Update gateway to pass reasoning parameters to OpenRouter
Track reasoning token usage for cost optimization

Implementation Details (2025-12-24):

Implemented using TDD with 80 new tests (1299 total tests pass).

Module Structure: src/llm_council/reasoning/

File	Purpose
`types.py`	`ReasoningEffort` enum, `ReasoningConfig` frozen dataclass, `should_apply_reasoning()`
`tracker.py`	`ReasoningUsage`, `AggregatedUsage`, `extract_reasoning_usage()`, `aggregate_reasoning_usage()`
`__init__.py`	Module exports

Tier-Effort Mapping:

quick → MINIMAL (10%)
balanced → LOW (20%)
high → MEDIUM (50%)
reasoning → HIGH (80%)

Domain Overrides: math→HIGH, coding→MEDIUM, creative→MINIMAL

Stage Configuration:

stage1: true (primary responses)
stage2: false (peer reviews)
stage3: true (synthesis)

GitHub Issues: #97-#100 (all completed)

Validation Gate: ✅ PASSED

Reasoning parameters correctly applied for all reasoning-tier queries
Token usage tracking shows expected budget allocation
No regressions in non-reasoning tiers (1299 tests pass)

Phase 3: Internal Performance Tracking (v0.17.x) ✅ IMPLEMENTED

Council Recommendation: Instead of scraping external benchmarks (high maintenance risk), implement internal performance tracking based on actual council session outcomes.

Track model performance per council session:
- Borda score received (ModelSessionMetric.borda_score)
- Response latency (ModelSessionMetric.latency_ms)
- Parse success rate (ModelSessionMetric.parse_success)
- Reasoning quality (optional reasoning_tokens_used)
Build Internal Performance Index from historical sessions
- InternalPerformanceTracker with rolling window aggregation
- ModelPerformanceIndex with mean_borda_score, p50/p95_latency, parse_success_rate
Use internal metrics for quality scoring (replaces external benchmarks)
- get_quality_score() returns 0-100 normalized score
- Cold start: unknown models get neutral score (50)
Implement rolling window decay (recent sessions weighted higher)
- Exponential decay: weight = exp(-days_ago / decay_days)
- Default decay_days = 30

Implementation Details:

src/llm_council/performance/ module (4 files, ~700 lines)
70 TDD tests in tests/test_performance_*.py
JSONL storage pattern (follows bias_persistence.py)
Configuration via PerformanceTrackerConfig in unified_config.py

Validation Gate: Phase 3 complete when:

100+ sessions tracked with metrics (tracked via confidence_level=HIGH)
Internal quality scores correlate with Borda outcomes (by design)
Model selection uses quality_score from tracker

Phase 4: External Benchmark Integration (DEFERRED) ⏸️

Council Warning: External benchmark scraping is HIGH-RISK due to:

API instability (LMArena, LiveBench change formats frequently)
Maintenance burden (scrapers break silently)
Data staleness (monthly updates don't reflect rapid model changes)

Deferred until: Internal Performance Tracking validates the value of quality metrics.

If implemented:

Start with manual JSON snapshots (not automated scrapers)
Implement LMArena Elo as optional quality boost (not required)
LiveBench for contamination-free validation only
Create benchmark staleness alerts (>30 days = warning)

Internal Performance Tracker

Council Recommendation: Build quality metrics from actual council session outcomes rather than external benchmarks.

Performance Metrics Schema

@dataclass
class ModelSessionMetric:
    """Performance data from a single council session."""
    session_id: str
    model_id: str
    timestamp: datetime

    # Stage 1 metrics
    response_latency_ms: int
    response_length: int
    parse_success: bool

    # Stage 2 metrics (from peer review)
    borda_score: float              # 0.0 - N (N = council size)
    normalized_rank: float          # 0.0 - 1.0 (1.0 = best)
    rubric_scores: Optional[Dict[str, float]]  # If rubric scoring enabled

    # Stage 3 metrics (from chairman selection)
    selected_for_synthesis: bool    # Was this response referenced?

@dataclass
class ModelPerformanceIndex:
    """Aggregated performance for a model."""
    model_id: str
    sample_size: int
    last_updated: datetime

    # Aggregated metrics
    mean_borda_score: float
    mean_normalized_rank: float
    p50_latency_ms: int
    p95_latency_ms: int
    parse_success_rate: float
    selection_rate: float           # How often selected for synthesis

    # Confidence
    confidence: str  # INSUFFICIENT (<10), PRELIMINARY (10-30), MODERATE (30-100), HIGH (>100)

class InternalPerformanceTracker:
    """Track and aggregate model performance from council sessions."""

    def __init__(self, store_path: Path, decay_days: int = 30):
        self.store_path = store_path
        self.decay_days = decay_days

    def record_session(self, session_metrics: List[ModelSessionMetric]) -> None:
        """Record metrics from a completed council session."""
        # Atomic append to JSONL store
        ...

    def get_model_index(self, model_id: str) -> ModelPerformanceIndex:
        """Get aggregated performance for a model with rolling window."""
        # Apply exponential decay to older sessions
        # Recent sessions weighted higher
        ...

    def get_quality_score(self, model_id: str) -> float:
        """Get normalized quality score (0-100) for model selection."""
        index = self.get_model_index(model_id)
        if index.confidence == "INSUFFICIENT":
            return 50.0  # Default neutral score
        return index.mean_normalized_rank * 100

Integration with Selection Algorithm

def select_tier_models(tier: str, ...) -> List[str]:
    # ... hard constraints ...

    for model in eligible:
        # Use INTERNAL performance tracker instead of external benchmarks
        quality_score = performance_tracker.get_quality_score(model.model_id)
        # ... rest of scoring with tier-specific weights ...

Open Questions (Council Addressed)

Resolved by Council Review

Question	Council Answer
Should benchmark scores override tier selection?	No. Tiers represent user intent (speed vs quality tradeoff). Benchmarks inform selection within tier.
How to handle new models with no data?	Default neutral score (50). Use provider metadata only until internal performance data accumulates.
Balance between performance and cost?	Tier-specific. Quick tier: yes, select cheaper. Reasoning tier: never compromise on quality.
Auto-apply reasoning parameters?	Yes, by tier. Reasoning tier = high effort, quick tier = minimal effort.
Handle benchmark gaming?	Use internal metrics. Council session outcomes are harder to game than public benchmarks.

Remaining Open Questions

What sample size validates Internal Performance Index?
- Council suggested 100+ sessions for HIGH confidence
- Is 30+ sessions sufficient for MODERATE confidence?
Should models with LOW internal scores be automatically demoted?
- Threshold for exclusion from tier pools?
- Grace period for new models?
How to bootstrap Internal Performance Tracker?
- Run shadow sessions with all available models?
- Start with static config and learn incrementally?

Issues Identified in Full Quorum Review

A. Cold Start Problem (Claude, Gemini)

"When a new model appears in OpenRouter, it has zero internal performance data."

Recommended Solutions:

Assign temporary "phantom score" equivalent to tier average until 10+ samples
Implement Epsilon-Greedy exploration (small % of requests try new models)
Minimum sessions required before model enters regular rotation
Manual allowlist for high-profile new releases

B. Borda Score Normalization (Claude)

"A 5-model session gives max score of 4; an 8-model session gives max of 7."

Solution: Normalize to percentile rank (0.0-1.0) rather than raw Borda counts:

normalized_rank = (council_size - borda_position) / council_size

C. Parse Success Definition (Claude) Define parse success as ALL of:

Valid JSON returned (if JSON expected)
Schema-compliant response
Extractable vote/rationale for Stage 2

D. Anti-Herding Edge Case (Gemini)

"If only 2 models pass hard constraints, the system might oscillate wildly."

Solution: Disable Anti-Herding when eligible model count < 3.

E. Degradation Behavior (Claude)

"What happens when ALL eligible models for a tier fall below acceptable thresholds?"

⚠️ CRITICAL: Strategic Council Review - Vendor Dependency Risk​

Verdict: CONDITIONAL APPROVAL​

The "Sovereign Orchestrator" Philosophy​

Blocking Conditions for Approval​

Implementation Notes (2025-12-23)​

Strategic Decision: Option C+D (Hybrid + Abstraction)​

Vendor Dependency Analysis​

Affiliate/Reseller Model: NOT VIABLE​

Required Abstraction Architecture​

MetadataProvider Interface (MANDATORY)​

Static Registry Schema (MANDATORY)​

Offline Mode (MANDATORY)​

Technical Council Review Summary​

Technical Review (2025-12-23) - Full Quorum​

First Technical Review (2025-12-23, 3/4 models)​

Key Technical Recommendations​

Council Consensus Points​

Context​

Problem Statement​

Current Architecture Gaps​

Existing Foundation (ADRs 020, 022, 024)​

Decision​

Architecture Overview​

Data Sources​

1. OpenRouter Models API​

2. Benchmark Leaderboards​

3. OpenRouter Rankings​

Model Parameter Optimization​

Reasoning Mode Parameters​

Parameter Detection​

Dynamic Tier Pool Management​

Tier Requirements Matrix​

Dynamic Pool Selection Algorithm​

Benchmark Score Normalization (DEFERRED - Phase 4)​

Integration Points​

1. Layer 1 Enhancement (ADR-022)​

2. Layer 2 Enhancement (ADR-020)​

3. Gateway Enhancement (ADR-023)​

Caching and Refresh Strategy​

Cache Layers​

Implementation​

Configuration​

Environment Variables​

YAML Configuration​

Risks and Mitigations​

Council-Identified Risks (High Priority)​

Original Risks (Updated)​

Success Metrics​

Phase 1 Success Metrics (Model Metadata Layer)​

Phase 2 Success Metrics (Reasoning Parameters)​

Phase 3 Success Metrics (Internal Performance Tracker)​

Overall Success Metrics​

Implementation Phases​

Phase 1: Model Metadata Layer (v0.15.x) ✅ IMPLEMENTED​

Phase 2: Reasoning Parameter Optimization (v0.16.x) ✅ IMPLEMENTED​

Phase 3: Internal Performance Tracking (v0.17.x) ✅ IMPLEMENTED​

Phase 4: External Benchmark Integration (DEFERRED) ⏸️​

Internal Performance Tracker​

Performance Metrics Schema​

Integration with Selection Algorithm​

Open Questions (Council Addressed)​

Resolved by Council Review​

Remaining Open Questions​

Issues Identified in Full Quorum Review​

References​

External Sources​

Related ADRs​

Research​