ADR-029: Model Audition Mechanism
Status: ACCEPTED (Revised per Council Review 2025-12-24) Date: 2025-12-24 Decision Makers: Engineering, Architecture Depends On: ADR-028 (Dynamic Candidate Discovery), ADR-027 (Frontier Tier) Council Review: Reasoning tier (gpt-5.2-pro, claude-opus-4.5, gemini-3-pro-preview, grok-4.1-fast)
Context
When new models are discovered via ADR-028, they lack performance history. The system cannot accurately score them because:
- No latency measurements exist
- No quality observations from past sessions
- No availability/reliability data
Cold Start Problem: New models are either never selected (no history → low score) or selected blindly (no data to validate quality).
Decision
Implement a volume-based audition mechanism with Shadow Mode integration and explicit state machine progression.
Model Lifecycle State Machine (Council Requirement)
Critical Feedback: Time-based graduation is unreliable. A model used once in 30 days isn't "proven."
┌─────────────────────────────────────────────────────────────────────────────┐
│ Model Lifecycle State Machine │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ 10 sessions ┌───────────┐ 25 sessions ┌──────┐ │
│ │ SHADOW │ ─────────────────▶│ PROBATION │ ─────────────────▶│ EVAL │ │
│ │ │ (min 3 days) │ │ (min 7 days) │ │ │
│ └──────────┘ └───────────┘ └──────┘ │
│ │ │ │ │
│ │ 3+ failures │ 5+ failures │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ QUARANTINE │ │
│ │ (Cooldown: 24h shadow, then retry from SHADOW) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────┐ 50 sessions ┌──────┐ │
│ │ EVAL │ ─────────────────▶│ FULL │ (Normal selection, full authority) │
│ │ │ (quality ≥75th │ │ │
│ └──────┘ percentile) └──────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
State Definitions
from enum import Enum, auto
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
class AuditionState(Enum):
"""Model audition lifecycle states."""
SHADOW = auto() # Non-binding votes, observation only
PROBATION = auto() # Limited selection, paired with proven models
EVALUATION = auto() # Weighted selection, building confidence
FULL = auto() # Normal selection, full voting authority
QUARANTINE = auto() # Temporarily excluded due to failures
@dataclass
class AuditionStatus:
"""Tracks model's audition progress."""
model_id: str
state: AuditionState
session_count: int = 0
first_seen: Optional[datetime] = None
last_seen: Optional[datetime] = None
consecutive_failures: int = 0
quality_percentile: Optional[float] = None
quarantine_until: Optional[datetime] = None
def days_tracked(self) -> int:
"""Days since first session."""
if self.first_seen is None:
return 0
return (datetime.utcnow() - self.first_seen).days
@dataclass(frozen=True)
class GraduationCriteria:
"""Criteria for state transitions (volume-based)."""
# SHADOW → PROBATION
shadow_min_sessions: int = 10
shadow_min_days: int = 3
shadow_max_failures: int = 3
# PROBATION → EVALUATION
probation_min_sessions: int = 25
probation_min_days: int = 7
probation_max_failures: int = 5
# EVALUATION → FULL
eval_min_sessions: int = 50
eval_min_quality_percentile: float = 0.75 # Must be in top 25%
# QUARANTINE
quarantine_cooldown_hours: int = 24
State Transition Logic
def evaluate_state_transition(
status: AuditionStatus,
criteria: GraduationCriteria
) -> Optional[AuditionState]:
"""
Determine if model should transition states.
Volume-based graduation per Council recommendation.
"""
now = datetime.utcnow()
# Check quarantine expiry
if status.state == AuditionState.QUARANTINE:
if status.quarantine_until and now >= status.quarantine_until:
return AuditionState.SHADOW # Retry from shadow
return None # Stay quarantined
# Check failure thresholds (any state → QUARANTINE)
if status.state == AuditionState.SHADOW:
if status.consecutive_failures >= criteria.shadow_max_failures:
return AuditionState.QUARANTINE
if status.state == AuditionState.PROBATION:
if status.consecutive_failures >= criteria.probation_max_failures:
return AuditionState.QUARANTINE
# SHADOW → PROBATION
if status.state == AuditionState.SHADOW:
if (status.session_count >= criteria.shadow_min_sessions
and status.days_tracked() >= criteria.shadow_min_days):
return AuditionState.PROBATION
# PROBATION → EVALUATION
if status.state == AuditionState.PROBATION:
if (status.session_count >= criteria.probation_min_sessions
and status.days_tracked() >= criteria.probation_min_days):
return AuditionState.EVALUATION
# EVALUATION → FULL
if status.state == AuditionState.EVALUATION:
if (status.session_count >= criteria.eval_min_sessions
and status.quality_percentile is not None
and status.quality_percentile >= criteria.eval_min_quality_percentile):
return AuditionState.FULL
return None # No transition
Shadow Mode Integration (ADR-027)
Critical Council Feedback: Audition models must NOT influence consensus until proven.
from llm_council.voting import VotingAuthority
# State → Voting Authority mapping
STATE_VOTING_AUTHORITY = {
AuditionState.SHADOW: VotingAuthority.ADVISORY, # Non-binding
AuditionState.PROBATION: VotingAuthority.ADVISORY, # Non-binding
AuditionState.EVALUATION: VotingAuthority.ADVISORY, # Non-binding until FULL
AuditionState.FULL: VotingAuthority.FULL, # Full voting rights
AuditionState.QUARANTINE: VotingAuthority.EXCLUDED, # Not selected
}
def get_voting_authority(model_id: str, tracker: AuditionTracker) -> VotingAuthority:
"""Get voting authority based on audition state."""
status = tracker.get_status(model_id)
if status is None:
# Unknown model - treat as new (SHADOW)
return VotingAuthority.ADVISORY
return STATE_VOTING_AUTHORITY.get(status.state, VotingAuthority.ADVISORY)
Selection with Audition (Revised)
Council Feedback: Code inconsistency fixed. Progressive weight replaces epsilon-greedy binary.
def select_with_audition(
scored_candidates: List[Tuple[str, float]],
tracker: AuditionTracker,
count: int = 4,
) -> List[str]:
"""
Select models with state-appropriate weighting.
Progressive weight approach (NOT epsilon-greedy):
- SHADOW/PROBATION: 30% selection weight
- EVALUATION: 30-100% (scaled by session count)
- FULL: 100% selection weight
"""
weighted_candidates = []
for model_id, score in scored_candidates:
status = tracker.get_status(model_id)
weight = get_selection_weight(status)
weighted_score = score * weight
weighted_candidates.append((model_id, weighted_score, weight))
# Sort by weighted score
weighted_candidates.sort(key=lambda x: -x[1])
# Enforce max audition seats (protect council quality)
selected = []
audition_count = 0
max_audition_seats = 1 # Only 1 audition model per session
for model_id, weighted_score, weight in weighted_candidates:
if len(selected) >= count:
break
# Is this an audition model (weight < 1.0)?
is_audition = weight < 1.0
if is_audition:
if audition_count >= max_audition_seats:
continue # Skip, already have max audition models
audition_count += 1
selected.append(model_id)
return selected
def get_selection_weight(status: Optional[AuditionStatus]) -> float:
"""
Get selection weight based on audition state.
Consistent with table definitions (fixes code/table mismatch).
"""
if status is None:
return 0.3 # New model: SHADOW weight
if status.state == AuditionState.SHADOW:
return 0.3
if status.state == AuditionState.PROBATION:
return 0.3
if status.state == AuditionState.EVALUATION:
# Scale from 0.3 to 1.0 based on session progress
# At 25 sessions (EVAL entry): 0.3
# At 50 sessions (FULL graduation): 1.0
progress = min(1.0, (status.session_count - 25) / 25)
return 0.3 + (0.7 * progress)
if status.state == AuditionState.FULL:
return 1.0
if status.state == AuditionState.QUARANTINE:
return 0.0 # Never select
return 0.3 # Default: cautious
# REMOVED: Epsilon-greedy approach (inconsistent with progressive weights)
Probationary Periods (Revised - Volume-Based)
| State | Sessions | Min Days | Selection Weight | Max Seats | Voting |
|---|---|---|---|---|---|
| Shadow | 0-10 | 3 | 30% | 1 | Advisory |
| Probation | 10-25 | 7 | 30% | 1 | Advisory |
| Evaluation | 25-50 | - | 30-100% | 1 | Advisory |
| Full | 50+ | - | 100% | Any | Full |
| Quarantine | - | - | 0% | 0 | Excluded |
Audition Safeguards
- Shadow Mode by default - New models don't affect consensus
- Volume-based graduation - Requires actual usage, not just time
- Max 1 audition per council session - Limit risk exposure
- Paired with proven models - Always have reliable responses
- Quality gate for FULL status - Must be top 25% to graduate
- Quarantine on failures - Automatic exclusion with cooldown
- Consensus exclusion - Audition models excluded from tie-breaking
Configuration
council:
model_intelligence:
audition:
enabled: true
# State progression (volume-based)
shadow:
min_sessions: 10
min_days: 3
max_failures: 3
probation:
min_sessions: 25
min_days: 7
max_failures: 5
evaluation:
min_sessions: 50
min_quality_percentile: 0.75
quarantine:
cooldown_hours: 24
# Selection limits
max_audition_seats: 1
Environment Variables
| Variable | Type | Default | Purpose |
|---|---|---|---|
LLM_COUNCIL_AUDITION_ENABLED | bool | true | Enable audition mechanism |
LLM_COUNCIL_AUDITION_MAX_SEATS | int | 1 | Max audition models per session |
LLM_COUNCIL_AUDITION_SHADOW_SESSIONS | int | 10 | Sessions before probation |
LLM_COUNCIL_AUDITION_EVAL_SESSIONS | int | 50 | Sessions for full graduation |
Observability (Council Requirement)
# Metrics to emit
audition.state_transition{model_id, from_state, to_state}
audition.selection{model_id, state, selected}
audition.failure{model_id, state, failure_type}
audition.quarantine{model_id, reason, cooldown_hours}
audition.graduation{model_id, quality_percentile}
# Structured logging
{
"event": "audition_state_change",
"model_id": "openai/gpt-5.2",
"from_state": "PROBATION",
"to_state": "EVALUATION",
"session_count": 25,
"days_tracked": 8,
"quality_percentile": null
}
Consequences
Positive
- Solves cold start problem for new models
- Volume-based graduation ensures actual usage
- Shadow Mode protects consensus from unproven models
- Explicit state machine provides clear lifecycle
- Quality gate prevents low-quality model graduation
Negative
- Audition models contribute responses but not votes
- Longer path to full status (50 sessions minimum)
- Additional complexity in selection logic
- Quarantine may miss temporary issues vs systemic
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Audition model pollutes context | Shadow Mode (advisory only) |
| Volume manipulation (spam requests) | Min days requirement alongside sessions |
| Good model stuck in evaluation | Clear quality percentile metric |
| Quarantine too aggressive | Configurable thresholds, auto-retry |
Testing Strategy
class TestAuditionMechanism:
def test_state_transitions_volume_based(self):
"""State transitions require session count, not just time."""
def test_shadow_mode_no_voting(self):
"""Shadow state models have ADVISORY voting authority."""
def test_quarantine_on_failures(self):
"""Consecutive failures trigger quarantine."""
def test_quarantine_cooldown_and_retry(self):
"""Models retry from SHADOW after cooldown."""
def test_max_audition_seats(self):
"""Only 1 audition model selected per session."""
def test_quality_gate_for_full(self):
"""EVALUATION → FULL requires 75th percentile quality."""
def test_selection_weight_progression(self):
"""Weight scales 0.3 → 1.0 during EVALUATION."""
Implementation Plan
- Define
AuditionStateenum andAuditionStatusdataclass - Implement state transition logic with volume-based criteria
- Add
AuditionTrackerwith persistence (JSONL) - Integrate with
VotingAuthorityfrom ADR-027 - Implement
select_with_audition()with progressive weights - Add metrics and structured logging
- Add configuration via YAML and env vars
- Add comprehensive tests