Skip to main content

ADR-026: Dynamic Model Intelligence and Benchmark-Driven Selection

Status: APPROVED (Blocking Conditions Implemented) Date: 2025-12-23 Decision Makers: Engineering, Architecture Council Review: 2025-12-23 (Strategic + Technical Reviews) Layer Assignment: Cross-cutting (L1-L4 integration) Implementation: 2025-12-23 (Blocking Conditions 1-3)


⚠️ CRITICAL: Strategic Council Review - Vendor Dependency Risk

Verdict: CONDITIONAL APPROVAL

ADR-026 is NOT APPROVED in its current form. The council identified critical vendor dependency risks that must be addressed before implementation.

"We cannot build the core 'brain' of an open-source project on proprietary APIs that we do not control." — Council Consensus

The "Sovereign Orchestrator" Philosophy

The council unanimously adopts this architectural principle:

The open-source version of LLM Council must function as a complete, independent utility. External services (like OpenRouter or Not Diamond) must be treated as PLUGINS, not foundations.

If the internet is disconnected or if an API key is revoked, the software must still boot, run, and perform its core function (orchestrating LLMs), even if quality is degraded.

Blocking Conditions for Approval

#ConditionStatusPriority
1Add ModelMetadataProvider abstraction interface✅ COMPLETEDBLOCKING
2Implement StaticRegistryProvider (30+ models)✅ COMPLETED (31 models)BLOCKING
3Add offline mode (LLM_COUNCIL_OFFLINE=true)✅ COMPLETEDBLOCKING
4Evaluate LiteLLM as unified abstraction✅ COMPLETED (as fallback)High
5Document degraded vs. enhanced feature matrix📋 RequiredMedium

Implementation Notes (2025-12-23)

The blocking conditions were implemented using TDD (Test-Driven Development) with 86 passing tests.

Module Structure: src/llm_council/metadata/

FilePurpose
types.pyModelInfo frozen dataclass, QualityTier enum, Modality enum
protocol.pyMetadataProvider @runtime_checkable Protocol
static_registry.pyStaticRegistryProvider with YAML + LiteLLM fallback
litellm_adapter.pyLazy LiteLLM import for metadata extraction
offline.pyis_offline_mode() and check_offline_mode_startup()
__init__.pyget_provider() singleton factory, module exports

Bundled Registry: src/llm_council/models/registry.yaml

31 models from 8 providers:

  • OpenAI (7): gpt-4o, gpt-4o-mini, gpt-5.2-pro, o1, o1-preview, o1-mini, o3-mini
  • Anthropic (5): claude-opus-4.5, claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus, claude-3-sonnet
  • Google (5): gemini-3-pro-preview, gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash
  • xAI (2): grok-4, grok-4.1-fast
  • DeepSeek (2): deepseek-r1, deepseek-chat
  • Meta (2): llama-3.3-70b, llama-3.1-405b
  • Mistral (2): mistral-large-2411, mistral-medium
  • Ollama (6): llama3.2, mistral, qwen2.5:14b, codellama, phi3, deepseek-r1:8b

LiteLLM Integration: Used as fallback in the priority chain (local registry > LiteLLM > 4096 default). Lazy import prevents startup failures when LiteLLM is not installed.

GitHub Issues: #89-#92 (all completed)

Strategic Decision: Option C+D (Hybrid + Abstraction)

FeatureOSS (Self-Hosted)Council Cloud (Commercial)
Model MetadataStatic library (LiteLLM) + Manual YAML configReal-time dynamic sync via OpenRouter
RoutingHeuristic rules (latency/cost-based)Intelligent ML-based (Not Diamond)
IntegrationsBYOK (Bring Your Own Keys)Managed Fleet (one bill, instant access)
Operationslocalhost / Individual instanceTeam governance, analytics, SSO

Vendor Dependency Analysis

ServiceCurrent RoleRisk LevelRequired Mitigation
OpenRouterMetadata API, GatewayHIGHStatic fallback + LiteLLM
Not DiamondModel routing, ClassificationMEDIUMHeuristic fallback (exists)
RequestyAlternative gatewayLOWAlready optional

Affiliate/Reseller Model: NOT VIABLE

"Reliance on affiliate revenue or tight coupling creates Platform Risk. If OpenRouter releases 'OpenRouter Agents,' Council becomes obsolete instantly. Furthermore, council-cloud cannot withstand margin compression." — Council

Decision: Use external services to lower the User's barrier to entry, not as the backbone of the Product's value.


Required Abstraction Architecture

MetadataProvider Interface (MANDATORY)

from typing import Protocol, Optional, Dict, List
from dataclasses import dataclass

@dataclass
class ModelInfo:
id: str
context_window: int
pricing: Dict[str, float] # {"prompt": 0.01, "completion": 0.03}
supported_parameters: List[str]
modalities: List[str]
quality_tier: str # "frontier" | "standard" | "economy"

class MetadataProvider(Protocol):
"""Abstract interface for model metadata sources."""

def get_model_info(self, model_id: str) -> Optional[ModelInfo]: ...
def get_context_window(self, model_id: str) -> int: ...
def get_pricing(self, model_id: str) -> Dict[str, float]: ...
def supports_reasoning(self, model_id: str) -> bool: ...
def list_available_models(self) -> List[str]: ...

class StaticRegistryProvider(MetadataProvider):
"""Default: Offline-safe provider using bundled registry + LiteLLM."""

def __init__(self, registry_path: Path = None):
self.registry = self._load_registry(registry_path)
self.litellm_data = self._load_litellm_model_map()

def get_context_window(self, model_id: str) -> int:
# 1. Check local config override
if model_id in self.registry:
return self.registry[model_id].context_window
# 2. Check LiteLLM library
if model_id in self.litellm_data:
return self.litellm_data[model_id].context_window
# 3. Safe default
return 4096

class DynamicMetadataProvider(MetadataProvider):
"""Optional: Real-time metadata from OpenRouter API."""

async def refresh(self) -> None:
"""Fetch latest model data - requires API key."""
...

Static Registry Schema (MANDATORY)

# models/registry.yaml - Shipped with OSS
version: "1.0"
updated: "2025-12-23"
models:
- id: "openai/gpt-4o"
context_window: 128000
pricing:
prompt: 0.0025
completion: 0.01
supported_parameters: ["temperature", "top_p", "tools"]
modalities: ["text", "vision"]
quality_tier: "frontier"

- id: "anthropic/claude-opus-4.5"
context_window: 200000
pricing:
prompt: 0.015
completion: 0.075
supported_parameters: ["temperature", "top_p", "tools", "reasoning"]
modalities: ["text", "vision"]
quality_tier: "frontier"

- id: "ollama/llama3.2"
provider: "ollama"
context_window: 128000
pricing:
prompt: 0
completion: 0
modalities: ["text"]
quality_tier: "local"

Offline Mode (MANDATORY)

# Force offline operation - MUST work without any external calls
export LLM_COUNCIL_OFFLINE=true

When offline mode is enabled:

  1. Use StaticRegistryProvider exclusively
  2. Disable all external metadata/routing calls
  3. Log INFO message about limited/stale metadata
  4. All core council operations MUST succeed

Technical Council Review Summary

Technical Review (2025-12-23) - Full Quorum

ModelVerdictRankResponse Time
Claude Opus 4.5CONDITIONAL APPROVAL#123.4s
Gemini 3 ProAPPROVE#231.4s
Grok 4APPROVE#359.6s
GPT-4oAPPROVE#49.8s

"The council successfully identified Response C (Claude) as the superior review, noting its crucial detection of mathematical flaws (Borda normalization with variable pool sizes) and logical gaps (Cold Start) missed by other responses."

First Technical Review (2025-12-23, 3/4 models)

Approved Components:

  • Dynamic metadata integration via OpenRouter API (pricing, availability, capability detection)
  • Reasoning parameter optimization (reasoning_effort, budget_tokens)
  • Integration points with existing L1-L4 architecture

Returned for Revision (Now Resolved):

  • Benchmark scraping strategy → Deferred to Phase 4, use Internal Performance Tracker
  • Single scoring algorithm with "magic number" weights → Tier-Specific Weighting Matrices

Key Technical Recommendations

RecommendationStatusPriority
Add Context Window as hard constraint✅ IncorporatedCritical
Replace single scoring with Tier-Specific Weighting✅ IncorporatedHigh
Defer benchmark scraping to optional Phase 4✅ IncorporatedHigh
Add Anti-Herding logic✅ IncorporatedMedium
Implement Internal Performance Tracker✅ IncorporatedMedium
Cold Start handling for new models📋 DocumentedMedium
Borda score normalization📋 DocumentedMedium
Anti-Herding edge case (<3 models)📋 DocumentedLow

Council Consensus Points

  1. Context Window is a hard pass/fail constraint - must filter before scoring, not weight
  2. Tier-specific weighting is essential - quick tier prioritizes speed, reasoning tier prioritizes quality
  3. Benchmark scraping is high-risk - external APIs change frequently, creates maintenance nightmare
  4. Internal performance data is more valuable - track actual council session outcomes
  5. Phased approach required - decouple metadata (proven value) from benchmark intelligence (speculative)
  6. Cold Start needs exploration strategy - new models need "audition" mechanism (Phase 3)
  7. LiteLLM strongly recommended - use as library for metadata, not just proxy

Context

Problem Statement

The LLM Council's current model selection relies on static configuration that quickly becomes stale in the rapidly evolving model landscape. December 2025 alone saw major releases from all frontier labs:

Release DateModelProvider
Nov 17, 2025Grok 4.1xAI
Nov 18, 2025Gemini 3 ProGoogle
Nov 24, 2025Claude Opus 4.5Anthropic
Dec 11, 2025GPT-5.2OpenAI

Our tier pools in config.py reference models that may be:

  • Deprecated or renamed (model identifiers change)
  • Outperformed by newer models (benchmarks shift monthly)
  • Suboptimally configured (missing reasoning parameters)
  • Unavailable or rate-limited (provider status changes)

Current Architecture Gaps

GapImpactCurrent State
Static tier poolsStale model selectionHardcoded in config.py
No benchmark integrationSuboptimal model-task matchingManual updates
No model metadataMissing capabilities detectionAssumed uniform
No reasoning parametersUnderutilized model capabilitiesDefault parameters only
No availability trackingFailures on unavailable modelsReactive error handling

Existing Foundation (ADRs 020, 022, 024)

The architecture already supports dynamic model selection:

ADRComponentOpportunity
ADR-020Not Diamond integrationModel routing API exists but uses static candidates
ADR-022Tier contractsallowed_models field could be dynamically populated
ADR-024Layer architectureL1 tier selection could query external data sources

Decision

Implement a Model Intelligence Layer that provides real-time model metadata, benchmark rankings, and dynamic pool management to all routing layers.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│ MODEL INTELLIGENCE LAYER (New) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Model Registry │ │ Benchmark Index │ │ Availability │ │
│ │ (OpenRouter API)│ │ (Leaderboards) │ │ Monitor │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┴────────────────────┘ │
│ │ │
│ ┌───────────▼───────────┐ │
│ │ Model Selector API │ │
│ │ - get_tier_models() │ │
│ │ - get_best_for_task()│ │
│ │ - get_model_params() │ │
│ └───────────┬───────────┘ │
│ │ │
└────────────────────────────────┼─────────────────────────────────────────────┘

┌────────────────────────┼────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ L1: Tier │ │ L2: Query │ │ L4: Gateway │
│ Selection │ │ Triage │ │ Routing │
│ (ADR-022) │ │ (ADR-020) │ │ (ADR-023) │
└───────────────┘ └───────────────┘ └───────────────┘

Data Sources

1. OpenRouter Models API

Endpoint: GET https://openrouter.ai/api/v1/models

Provides real-time model metadata:

{
"id": "anthropic/claude-opus-4-5-20250514",
"name": "Claude Opus 4.5",
"pricing": {
"prompt": "0.000015",
"completion": "0.000075"
},
"context_length": 200000,
"architecture": {
"input_modalities": ["text", "image"],
"output_modalities": ["text"]
},
"supported_parameters": ["temperature", "top_p", "reasoning"],
"top_provider": {
"is_moderated": true
}
}

Key Fields for Selection:

  • pricing - Cost optimization
  • context_length - Long document handling
  • supported_parameters - Reasoning mode detection
  • input_modalities - Multimodal capability

2. Benchmark Leaderboards

SourceDataUpdate FrequencyAPI
LMArenaElo ratings from 5M+ votesReal-timePublic
LiveBenchMonthly contamination-free benchmarksMonthlyPublic
Artificial AnalysisSpeed, cost, quality metricsWeeklyPublic
LLM StatsAggregated performance dataDailyPublic

Benchmark Categories:

  • Reasoning: GPQA Diamond, AIME 2025, ARC-AGI-2
  • Coding: SWE-bench, LiveCodeBench, Terminal-Bench
  • General: MMLU-Pro, Humanity's Last Exam
  • Speed: Tokens/second, time-to-first-token

3. OpenRouter Rankings

Endpoint: GET https://openrouter.ai/rankings

Usage-based popularity metrics (tokens served, request count).


Model Parameter Optimization

Reasoning Mode Parameters

OpenRouter supports unified reasoning parameters:

# For reasoning-capable models (o1, o3, GPT-5, Claude with thinking)
request_params = {
"reasoning": {
"effort": "high", # "minimal"|"low"|"medium"|"high"|"xhigh"
"max_tokens": 32000, # Budget for reasoning tokens
"exclude": False, # Include reasoning in response
}
}

Effort Level Budget Calculation:

budget_tokens = max(min(max_tokens * effort_ratio, 32000), 1024)

effort_ratio:
xhigh: 0.95
high: 0.80
medium: 0.50
low: 0.20
minimal: 0.10

Parameter Detection

def get_model_params(model_id: str, task_type: str) -> dict:
"""Get optimized parameters for model and task."""
model_info = model_registry.get(model_id)

params = {}

# Enable reasoning for supported models on complex tasks
if "reasoning" in model_info.supported_parameters:
if task_type in ["reasoning", "math", "coding"]:
params["reasoning"] = {
"effort": "high" if task_type == "reasoning" else "medium"
}

# Adjust temperature for task type
if task_type == "creative":
params["temperature"] = 0.9
elif task_type in ["coding", "math"]:
params["temperature"] = 0.2

return params

Dynamic Tier Pool Management

Tier Requirements Matrix

TierLatency BudgetCost CeilingMin ModelsRequired Capabilities
quickP95 < 10s< $0.001/req3Fast inference
balancedP95 < 45s< $0.01/req3-4Good reasoning
highP95 < 120s< $0.10/req4-5Full capability
reasoningP95 < 300s< $1.00/req3-4Extended thinking

Dynamic Pool Selection Algorithm

Council Revision: Algorithm updated per council feedback to:

  1. Add Context Window as hard pass/fail constraint
  2. Replace global weights with Tier-Specific Weighting Matrices
  3. Add Anti-Herding logic to prevent traffic concentration
@dataclass
class ModelScore:
model_id: str
benchmark_score: float # Normalized 0-100 (optional, from internal tracker)
latency_p95: float # Seconds
cost_per_request: float # USD
availability: float # 0-1
diversity_score: float # Provider diversity
context_window: int # Token limit (HARD CONSTRAINT)
recent_traffic: float # 0-1, for anti-herding

# COUNCIL RECOMMENDATION: Tier-Specific Weighting Matrices
# Replaces "magic number" global weights (0.4/0.2/0.2/0.1/0.1)
TIER_WEIGHTS = {
"quick": {
"latency": 0.45, # Speed is primary concern
"cost": 0.25, # Budget-conscious
"quality": 0.15, # Acceptable quality
"availability": 0.10,
"diversity": 0.05,
},
"balanced": {
"quality": 0.35, # Better quality
"latency": 0.25, # Still matters
"cost": 0.20, # Cost-aware
"availability": 0.10,
"diversity": 0.10,
},
"high": {
"quality": 0.50, # Quality is paramount
"availability": 0.20, # Must be reliable
"latency": 0.15, # Acceptable wait
"diversity": 0.10, # Multiple perspectives
"cost": 0.05, # Cost secondary
},
"reasoning": {
"quality": 0.60, # Best possible quality
"availability": 0.20, # Critical reliability
"diversity": 0.10, # Diverse reasoning
"latency": 0.05, # Patience for quality
"cost": 0.05, # Cost not a factor
},
}

def select_tier_models(
tier: str,
task_domain: Optional[str] = None,
count: int = 4,
required_context: Optional[int] = None, # NEW: context requirement
) -> List[str]:
"""Select optimal models for tier using multi-criteria scoring.

Council-Validated Algorithm:
1. Apply HARD CONSTRAINTS (pass/fail)
2. Score using TIER-SPECIFIC weights
3. Apply ANTI-HERDING penalty
4. Ensure PROVIDER DIVERSITY
"""

candidates = model_registry.get_available_models()
tier_config = TIER_REQUIREMENTS[tier]
weights = TIER_WEIGHTS[tier]

# ===== HARD CONSTRAINTS (Pass/Fail) =====
# Council Critical: Context window MUST be hard constraint, not weighted
eligible = [
m for m in candidates
if m.latency_p95 <= tier_config.latency_budget
and m.cost_per_request <= tier_config.cost_ceiling
and m.availability >= 0.95
# COUNCIL ADDITION: Context window as hard constraint
and (required_context is None or m.context_window >= required_context)
]

if not eligible:
logger.warning(f"No models meet hard constraints for tier={tier}")
return fallback_to_static_config(tier)

# ===== SOFT SCORING (Tier-Specific Weights) =====
scored = []
for model in eligible:
# Normalize scores to 0-1 range
latency_score = 1 - (model.latency_p95 / tier_config.latency_budget)
cost_score = 1 - (model.cost_per_request / tier_config.cost_ceiling)
quality_score = model.benchmark_score / 100 if model.benchmark_score else 0.5

score = (
quality_score * weights["quality"] +
latency_score * weights["latency"] +
cost_score * weights["cost"] +
model.availability * weights["availability"] +
model.diversity_score * weights["diversity"]
)

# Domain boost (task-specific enhancement)
if task_domain and task_domain in model.strengths:
score *= 1.15

# COUNCIL ADDITION: Anti-Herding Penalty
# Prevent traffic concentration on popular models
if model.recent_traffic > 0.3: # More than 30% of recent traffic
score *= (1 - (model.recent_traffic - 0.3) * 0.5) # Up to 35% penalty

scored.append((model.model_id, score))

# ===== DIVERSITY ENFORCEMENT =====
selected = select_with_diversity(scored, count, min_providers=2)

return selected

Benchmark Score Normalization (DEFERRED - Phase 4)

Council Warning: This section describes external benchmark integration which is DEFERRED to Phase 4. Use Internal Performance Tracker (Phase 3) for quality scoring in initial releases.

# DEFERRED: Only implement after Internal Performance Tracker validates value
def normalize_benchmark_scores(model_id: str) -> float:
"""Aggregate benchmark scores into single quality metric.

WARNING: External benchmark scraping is high-maintenance.
Prefer Internal Performance Tracker for quality scoring.
Only implement if internal metrics prove insufficient.
"""

# Start with manual JSON snapshots, NOT automated scrapers
scores = load_manual_benchmark_snapshot(model_id)

if not scores:
return None # Fall back to internal metrics

# Weighted aggregation (emphasize reasoning and coding)
weights = {
"lmarena_elo": 0.3, # Human preference
"livebench": 0.2, # Contamination-free
"gpqa_diamond": 0.25, # Science reasoning
"swe_bench": 0.25, # Coding capability
}

normalized = sum(
normalize_to_100(scores[k]) * weights[k]
for k in weights
if scores.get(k) is not None
)

return normalized

Integration Points

1. Layer 1 Enhancement (ADR-022)

# tier_contract.py modification
def create_tier_contract(tier: str, task_domain: Optional[str] = None) -> TierContract:
"""Create tier contract with dynamically selected models."""

# Use Model Intelligence Layer instead of static config
models = model_intelligence.select_tier_models(
tier=tier,
task_domain=task_domain,
count=TIER_MODEL_COUNTS[tier],
)

# Get tier-appropriate aggregator
aggregator = model_intelligence.get_aggregator_for_tier(tier)

return TierContract(
tier=tier,
allowed_models=models,
aggregator_model=aggregator,
**get_tier_timeout(tier),
)

2. Layer 2 Enhancement (ADR-020)

# not_diamond.py modification
async def route_with_intelligence(
query: str,
tier_contract: TierContract,
) -> RouteResult:
"""Route using Not Diamond + Model Intelligence."""

# Get task-appropriate candidates from intelligence layer
candidates = model_intelligence.select_tier_models(
tier=tier_contract.tier,
task_domain=classify_domain(query),
)

# Get optimized parameters for each candidate
params = {
model: model_intelligence.get_model_params(model, query)
for model in candidates
}

# Route using Not Diamond (with enriched candidates)
if is_not_diamond_available():
result = await not_diamond.route(query, candidates)
return RouteResult(
model=result.model,
params=params[result.model],
confidence=result.confidence,
)

# Fallback to intelligence-based selection
return RouteResult(
model=candidates[0],
params=params[candidates[0]],
confidence=0.7,
)

3. Gateway Enhancement (ADR-023)

# gateway/types.py modification
@dataclass
class GatewayRequest:
model: str
messages: List[CanonicalMessage]
# New: Model-specific parameters from intelligence layer
model_params: Optional[Dict[str, Any]] = None

def apply_model_params(self) -> Dict[str, Any]:
"""Apply optimized parameters to request."""
request = self.to_openai_format()
if self.model_params:
request.update(self.model_params)
return request

Caching and Refresh Strategy

Cache Layers

DataCache TTLRefresh Trigger
Model registry1 hourAPI call / manual
Benchmark scores24 hoursDaily cron
Availability status5 minutesHealth check failures
Latency metrics15 minutesRolling window

Implementation

class ModelIntelligenceCache:
def __init__(self):
self.registry_cache = TTLCache(maxsize=500, ttl=3600)
self.benchmark_cache = TTLCache(maxsize=100, ttl=86400)
self.availability_cache = TTLCache(maxsize=500, ttl=300)

async def refresh_registry(self):
"""Fetch latest model data from OpenRouter."""
async with httpx.AsyncClient() as client:
response = await client.get(
"https://openrouter.ai/api/v1/models",
headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
)
models = response.json()["data"]

for model in models:
self.registry_cache[model["id"]] = ModelInfo.from_api(model)

async def refresh_benchmarks(self):
"""Fetch latest benchmark data from leaderboards."""
# LMArena Elo
lmarena = await fetch_lmarena_leaderboard()
# LiveBench
livebench = await fetch_livebench_scores()
# Artificial Analysis
aa = await fetch_artificial_analysis()

# Merge and normalize
for model_id in self.registry_cache:
self.benchmark_cache[model_id] = BenchmarkData(
lmarena_elo=lmarena.get(model_id),
livebench=livebench.get(model_id),
artificial_analysis=aa.get(model_id),
)

Configuration

Environment Variables

# Model Intelligence Layer
LLM_COUNCIL_MODEL_INTELLIGENCE=true|false # Enable dynamic selection
LLM_COUNCIL_BENCHMARK_SOURCE=lmarena|livebench|artificial_analysis|aggregate
LLM_COUNCIL_REFRESH_INTERVAL=3600 # Registry refresh interval (seconds)

# Fallback to static config if intelligence unavailable
LLM_COUNCIL_STATIC_FALLBACK=true|false

# Minimum benchmark score thresholds
LLM_COUNCIL_MIN_BENCHMARK_SCORE=60 # 0-100 normalized
LLM_COUNCIL_MIN_AVAILABILITY=0.95 # 0-1

# Provider diversity
LLM_COUNCIL_MIN_PROVIDERS=2 # Minimum distinct providers per tier

YAML Configuration

Council Revision: Updated to use tier-specific weights instead of global weights.

council:
model_intelligence:
enabled: true
sources:
openrouter_api: true
# DEFERRED: External benchmark sources (Phase 4)
# lmarena: false
# livebench: false
# artificial_analysis: false
internal_performance: true # Phase 3: Use council session outcomes

refresh:
registry_ttl: 3600
# benchmark_ttl: 86400 # DEFERRED
availability_ttl: 300
performance_ttl: 3600 # Internal performance cache

selection:
# COUNCIL REVISION: Tier-specific weights instead of global weights
tier_weights:
quick:
latency: 0.45
cost: 0.25
quality: 0.15
availability: 0.10
diversity: 0.05
balanced:
quality: 0.35
latency: 0.25
cost: 0.20
availability: 0.10
diversity: 0.10
high:
quality: 0.50
availability: 0.20
latency: 0.15
diversity: 0.10
cost: 0.05
reasoning:
quality: 0.60
availability: 0.20
diversity: 0.10
latency: 0.05
cost: 0.05

constraints:
min_providers: 2
min_availability: 0.95
max_cost_multiplier: 10 # vs cheapest option

# COUNCIL ADDITION: Anti-Herding
anti_herding:
enabled: true
traffic_threshold: 0.3 # 30% of recent traffic
max_penalty: 0.35 # Up to 35% score reduction

parameters:
auto_reasoning: true # Enable reasoning params when appropriate
reasoning_effort_by_tier:
quick: minimal
balanced: low
high: medium
reasoning: high

# COUNCIL ADDITION: Internal Performance Tracker
performance_tracker:
enabled: true
store_path: "${HOME}/.llm-council/performance.jsonl"
decay_days: 30
min_samples_preliminary: 10
min_samples_moderate: 30
min_samples_high: 100

Risks and Mitigations

Council-Identified Risks (High Priority)

RiskLikelihoodImpactMitigation
Benchmark scraper breakageHIGHHIGHDEFER to Phase 4; use manual snapshots, not scrapers
Traffic herdingMediumHighAnti-Herding penalty in selection algorithm
Context window violationsMediumHighHard constraint filter (not weighted)
Magic number weightsN/AMediumTier-specific weight matrices

Original Risks (Updated)

RiskLikelihoodImpactMitigation
External API unavailabilityMediumHighStatic fallback, aggressive caching
Benchmark data stalenessMediumMediumDEFERRED: Internal Performance Tracker instead
Model identifier changesHighMediumFuzzy matching, alias tracking
Over-optimizationMediumMediumDiversity constraints, Anti-Herding logic
Cold start latencyLowMediumPre-warm cache on startup
Provider bias in benchmarksMediumLowDEFERRED: Internal metrics not susceptible
Internal metric biasMediumMediumMinimum sample size requirements, decay weighting

Success Metrics

Phase 1 Success Metrics (Model Metadata Layer)

MetricTargetMeasurement
Registry availability> 99% uptimeTrack OpenRouter API failures
Context window violations0 errorsMonitor "context exceeded" errors
Static fallback activation< 1% of requestsTrack fallback usage
Model freshness< 1 hour staleTrack registry refresh success

Phase 2 Success Metrics (Reasoning Parameters)

MetricTargetMeasurement
Parameter utilization100% for reasoning tierTrack reasoning param usage
Budget token efficiency> 80% utilizationCompare budget vs actual tokens
Reasoning qualityNo regressionCompare rubric scores before/after

Phase 3 Success Metrics (Internal Performance Tracker)

MetricTargetMeasurement
Session coverage> 95% trackedCount sessions with metrics
Internal metric correlation> 0.6 with BordaValidate internal scores vs outcomes
Model ranking stability< 10% weekly varianceTrack rank position changes
Selection improvement> 5% higher BordaCompare dynamic vs static selection

Overall Success Metrics

MetricTargetMeasurement
Benchmark correlation> 0.8DEFERRED: Internal metrics instead
Cost optimization-15% vs staticCompare equivalent quality
Tier pool diversity≥ 2 providersTrack provider distribution
Anti-Herding effectivenessNo model > 40% trafficMonitor traffic distribution

Implementation Phases

Council Recommendation: Decouple proven value (metadata) from speculative value (benchmark intelligence). Implement in strict phases with validation gates.

Phase 1: Model Metadata Layer (v0.15.x) ✅ IMPLEMENTED

Goal: Dynamic model discovery and capability detection via OpenRouter API.

Status: ✅ COMPLETE (2025-12-23) GitHub Issues: #93, #94, #95 Tests: 79 TDD tests (cache: 20, client: 20, provider: 24, selection: 35)

  • Implement OpenRouter API client (src/llm_council/metadata/openrouter_client.py)
  • Cache model metadata with TTL (1 hour registry, 5 min availability)
    • src/llm_council/metadata/cache.py: TTLCache, ModelIntelligenceCache
  • Add model capability detection (context window, reasoning support, modalities)
    • src/llm_council/metadata/dynamic_provider.py: DynamicMetadataProvider
  • Add Context Window as hard constraint in tier filtering
    • src/llm_council/metadata/selection.py: _meets_context_requirement()
  • Update get_tier_models() to use registry with static fallback
    • src/llm_council/metadata/selection.py: select_tier_models()
  • Implement Anti-Herding logic with traffic tracking
    • src/llm_council/metadata/selection.py: apply_anti_herding_penalty()
  • Add ModelIntelligenceConfig to unified_config.py
  • Add task_domain parameter to tier_contract.py

Environment Variables:

  • LLM_COUNCIL_MODEL_INTELLIGENCE=true enables dynamic selection
  • LLM_COUNCIL_OFFLINE=true forces static provider (takes precedence)

Validation Gate: ✅ PASSED

  • OpenRouter API client with timeout/error handling
  • Static fallback activates when API unavailable or offline mode enabled
  • All 1206 tests pass

Phase 1 "Hollow" Fix (2025-12-24):

Initial Phase 1 implementation used regex pattern matching ("hollow" implementation). Fixed to use real metadata from providers (Issues #105-#108).

FunctionBeforeAfter
_get_provider_safe()N/AReturns provider or None gracefully
_get_quality_score_from_metadata()Regex patternsReal QualityTier lookup
_get_cost_score_from_metadata()Regex patternsReal pricing data
_meets_context_requirement()Always TrueReal context window filtering

Quality Tier Scores:

  • FRONTIER: 0.95
  • STANDARD: 0.75
  • ECONOMY: 0.55
  • LOCAL: 0.40

Graceful Degradation: When metadata unavailable, falls back to heuristic estimates.

Phase 2: Reasoning Parameter Optimization (v0.16.x) ✅ IMPLEMENTED

Goal: Automatic reasoning parameter configuration for capable models.

  • Detect reasoning-capable models from registry metadata
  • Apply reasoning_effort parameter based on tier (quick=minimal, reasoning=high)
  • Calculate budget_tokens per effort level
  • Add task-specific parameter profiles (math→high effort, creative→minimal)
  • Update gateway to pass reasoning parameters to OpenRouter
  • Track reasoning token usage for cost optimization

Implementation Details (2025-12-24):

Implemented using TDD with 80 new tests (1299 total tests pass).

Module Structure: src/llm_council/reasoning/

FilePurpose
types.pyReasoningEffort enum, ReasoningConfig frozen dataclass, should_apply_reasoning()
tracker.pyReasoningUsage, AggregatedUsage, extract_reasoning_usage(), aggregate_reasoning_usage()
__init__.pyModule exports

Tier-Effort Mapping:

  • quick → MINIMAL (10%)
  • balanced → LOW (20%)
  • high → MEDIUM (50%)
  • reasoning → HIGH (80%)

Domain Overrides: math→HIGH, coding→MEDIUM, creative→MINIMAL

Stage Configuration:

  • stage1: true (primary responses)
  • stage2: false (peer reviews)
  • stage3: true (synthesis)

GitHub Issues: #97-#100 (all completed)

Validation Gate: ✅ PASSED

  • Reasoning parameters correctly applied for all reasoning-tier queries
  • Token usage tracking shows expected budget allocation
  • No regressions in non-reasoning tiers (1299 tests pass)

Phase 3: Internal Performance Tracking (v0.17.x) ✅ IMPLEMENTED

Council Recommendation: Instead of scraping external benchmarks (high maintenance risk), implement internal performance tracking based on actual council session outcomes.

  • Track model performance per council session:
    • Borda score received (ModelSessionMetric.borda_score)
    • Response latency (ModelSessionMetric.latency_ms)
    • Parse success rate (ModelSessionMetric.parse_success)
    • Reasoning quality (optional reasoning_tokens_used)
  • Build Internal Performance Index from historical sessions
    • InternalPerformanceTracker with rolling window aggregation
    • ModelPerformanceIndex with mean_borda_score, p50/p95_latency, parse_success_rate
  • Use internal metrics for quality scoring (replaces external benchmarks)
    • get_quality_score() returns 0-100 normalized score
    • Cold start: unknown models get neutral score (50)
  • Implement rolling window decay (recent sessions weighted higher)
    • Exponential decay: weight = exp(-days_ago / decay_days)
    • Default decay_days = 30

Implementation Details:

  • src/llm_council/performance/ module (4 files, ~700 lines)
  • 70 TDD tests in tests/test_performance_*.py
  • JSONL storage pattern (follows bias_persistence.py)
  • Configuration via PerformanceTrackerConfig in unified_config.py

Validation Gate: Phase 3 complete when:

  • 100+ sessions tracked with metrics (tracked via confidence_level=HIGH)
  • Internal quality scores correlate with Borda outcomes (by design)
  • Model selection uses quality_score from tracker

Phase 4: External Benchmark Integration (DEFERRED) ⏸️

Council Warning: External benchmark scraping is HIGH-RISK due to:

  • API instability (LMArena, LiveBench change formats frequently)
  • Maintenance burden (scrapers break silently)
  • Data staleness (monthly updates don't reflect rapid model changes)

Deferred until: Internal Performance Tracking validates the value of quality metrics.

If implemented:

  • Start with manual JSON snapshots (not automated scrapers)
  • Implement LMArena Elo as optional quality boost (not required)
  • LiveBench for contamination-free validation only
  • Create benchmark staleness alerts (>30 days = warning)

Internal Performance Tracker

Council Recommendation: Build quality metrics from actual council session outcomes rather than external benchmarks.

Performance Metrics Schema

@dataclass
class ModelSessionMetric:
"""Performance data from a single council session."""
session_id: str
model_id: str
timestamp: datetime

# Stage 1 metrics
response_latency_ms: int
response_length: int
parse_success: bool

# Stage 2 metrics (from peer review)
borda_score: float # 0.0 - N (N = council size)
normalized_rank: float # 0.0 - 1.0 (1.0 = best)
rubric_scores: Optional[Dict[str, float]] # If rubric scoring enabled

# Stage 3 metrics (from chairman selection)
selected_for_synthesis: bool # Was this response referenced?

@dataclass
class ModelPerformanceIndex:
"""Aggregated performance for a model."""
model_id: str
sample_size: int
last_updated: datetime

# Aggregated metrics
mean_borda_score: float
mean_normalized_rank: float
p50_latency_ms: int
p95_latency_ms: int
parse_success_rate: float
selection_rate: float # How often selected for synthesis

# Confidence
confidence: str # INSUFFICIENT (<10), PRELIMINARY (10-30), MODERATE (30-100), HIGH (>100)

class InternalPerformanceTracker:
"""Track and aggregate model performance from council sessions."""

def __init__(self, store_path: Path, decay_days: int = 30):
self.store_path = store_path
self.decay_days = decay_days

def record_session(self, session_metrics: List[ModelSessionMetric]) -> None:
"""Record metrics from a completed council session."""
# Atomic append to JSONL store
...

def get_model_index(self, model_id: str) -> ModelPerformanceIndex:
"""Get aggregated performance for a model with rolling window."""
# Apply exponential decay to older sessions
# Recent sessions weighted higher
...

def get_quality_score(self, model_id: str) -> float:
"""Get normalized quality score (0-100) for model selection."""
index = self.get_model_index(model_id)
if index.confidence == "INSUFFICIENT":
return 50.0 # Default neutral score
return index.mean_normalized_rank * 100

Integration with Selection Algorithm

def select_tier_models(tier: str, ...) -> List[str]:
# ... hard constraints ...

for model in eligible:
# Use INTERNAL performance tracker instead of external benchmarks
quality_score = performance_tracker.get_quality_score(model.model_id)
# ... rest of scoring with tier-specific weights ...

Open Questions (Council Addressed)

Resolved by Council Review

QuestionCouncil Answer
Should benchmark scores override tier selection?No. Tiers represent user intent (speed vs quality tradeoff). Benchmarks inform selection within tier.
How to handle new models with no data?Default neutral score (50). Use provider metadata only until internal performance data accumulates.
Balance between performance and cost?Tier-specific. Quick tier: yes, select cheaper. Reasoning tier: never compromise on quality.
Auto-apply reasoning parameters?Yes, by tier. Reasoning tier = high effort, quick tier = minimal effort.
Handle benchmark gaming?Use internal metrics. Council session outcomes are harder to game than public benchmarks.

Remaining Open Questions

  1. What sample size validates Internal Performance Index?

    • Council suggested 100+ sessions for HIGH confidence
    • Is 30+ sessions sufficient for MODERATE confidence?
  2. Should models with LOW internal scores be automatically demoted?

    • Threshold for exclusion from tier pools?
    • Grace period for new models?
  3. How to bootstrap Internal Performance Tracker?

    • Run shadow sessions with all available models?
    • Start with static config and learn incrementally?

Issues Identified in Full Quorum Review

A. Cold Start Problem (Claude, Gemini)

"When a new model appears in OpenRouter, it has zero internal performance data."

Recommended Solutions:

  • Assign temporary "phantom score" equivalent to tier average until 10+ samples
  • Implement Epsilon-Greedy exploration (small % of requests try new models)
  • Minimum sessions required before model enters regular rotation
  • Manual allowlist for high-profile new releases

B. Borda Score Normalization (Claude)

"A 5-model session gives max score of 4; an 8-model session gives max of 7."

Solution: Normalize to percentile rank (0.0-1.0) rather than raw Borda counts:

normalized_rank = (council_size - borda_position) / council_size

C. Parse Success Definition (Claude) Define parse success as ALL of:

  • Valid JSON returned (if JSON expected)
  • Schema-compliant response
  • Extractable vote/rationale for Stage 2

D. Anti-Herding Edge Case (Gemini)

"If only 2 models pass hard constraints, the system might oscillate wildly."

Solution: Disable Anti-Herding when eligible model count < 3.

E. Degradation Behavior (Claude)

"What happens when ALL eligible models for a tier fall below acceptable thresholds?"

Fallback Chain:

  1. Warn user and proceed with best-available
  2. Escalate to adjacent tier (quick→balanced, balanced→high)
  3. Fall back to static config as last resort

References

External Sources

Research