Skip to main content

ADR-029: Model Audition Mechanism

Status: ACCEPTED (Revised per Council Review 2025-12-24) Date: 2025-12-24 Decision Makers: Engineering, Architecture Depends On: ADR-028 (Dynamic Candidate Discovery), ADR-027 (Frontier Tier) Council Review: Reasoning tier (gpt-5.2-pro, claude-opus-4.5, gemini-3-pro-preview, grok-4.1-fast)


Context

When new models are discovered via ADR-028, they lack performance history. The system cannot accurately score them because:

  1. No latency measurements exist
  2. No quality observations from past sessions
  3. No availability/reliability data

Cold Start Problem: New models are either never selected (no history → low score) or selected blindly (no data to validate quality).


Decision

Implement a volume-based audition mechanism with Shadow Mode integration and explicit state machine progression.

Model Lifecycle State Machine (Council Requirement)

Critical Feedback: Time-based graduation is unreliable. A model used once in 30 days isn't "proven."

┌─────────────────────────────────────────────────────────────────────────────┐
│ Model Lifecycle State Machine │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ 10 sessions ┌───────────┐ 25 sessions ┌──────┐ │
│ │ SHADOW │ ─────────────────▶│ PROBATION │ ─────────────────▶│ EVAL │ │
│ │ │ (min 3 days) │ │ (min 7 days) │ │ │
│ └──────────┘ └───────────┘ └──────┘ │
│ │ │ │ │
│ │ 3+ failures │ 5+ failures │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ QUARANTINE │ │
│ │ (Cooldown: 24h shadow, then retry from SHADOW) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────┐ 50 sessions ┌──────┐ │
│ │ EVAL │ ─────────────────▶│ FULL │ (Normal selection, full authority) │
│ │ │ (quality ≥75th │ │ │
│ └──────┘ percentile) └──────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

State Definitions

from enum import Enum, auto
from dataclasses import dataclass
from datetime import datetime
from typing import Optional

class AuditionState(Enum):
"""Model audition lifecycle states."""
SHADOW = auto() # Non-binding votes, observation only
PROBATION = auto() # Limited selection, paired with proven models
EVALUATION = auto() # Weighted selection, building confidence
FULL = auto() # Normal selection, full voting authority
QUARANTINE = auto() # Temporarily excluded due to failures


@dataclass
class AuditionStatus:
"""Tracks model's audition progress."""
model_id: str
state: AuditionState
session_count: int = 0
first_seen: Optional[datetime] = None
last_seen: Optional[datetime] = None
consecutive_failures: int = 0
quality_percentile: Optional[float] = None
quarantine_until: Optional[datetime] = None

def days_tracked(self) -> int:
"""Days since first session."""
if self.first_seen is None:
return 0
return (datetime.utcnow() - self.first_seen).days


@dataclass(frozen=True)
class GraduationCriteria:
"""Criteria for state transitions (volume-based)."""
# SHADOW → PROBATION
shadow_min_sessions: int = 10
shadow_min_days: int = 3
shadow_max_failures: int = 3

# PROBATION → EVALUATION
probation_min_sessions: int = 25
probation_min_days: int = 7
probation_max_failures: int = 5

# EVALUATION → FULL
eval_min_sessions: int = 50
eval_min_quality_percentile: float = 0.75 # Must be in top 25%

# QUARANTINE
quarantine_cooldown_hours: int = 24

State Transition Logic

def evaluate_state_transition(
status: AuditionStatus,
criteria: GraduationCriteria
) -> Optional[AuditionState]:
"""
Determine if model should transition states.
Volume-based graduation per Council recommendation.
"""
now = datetime.utcnow()

# Check quarantine expiry
if status.state == AuditionState.QUARANTINE:
if status.quarantine_until and now >= status.quarantine_until:
return AuditionState.SHADOW # Retry from shadow
return None # Stay quarantined

# Check failure thresholds (any state → QUARANTINE)
if status.state == AuditionState.SHADOW:
if status.consecutive_failures >= criteria.shadow_max_failures:
return AuditionState.QUARANTINE

if status.state == AuditionState.PROBATION:
if status.consecutive_failures >= criteria.probation_max_failures:
return AuditionState.QUARANTINE

# SHADOW → PROBATION
if status.state == AuditionState.SHADOW:
if (status.session_count >= criteria.shadow_min_sessions
and status.days_tracked() >= criteria.shadow_min_days):
return AuditionState.PROBATION

# PROBATION → EVALUATION
if status.state == AuditionState.PROBATION:
if (status.session_count >= criteria.probation_min_sessions
and status.days_tracked() >= criteria.probation_min_days):
return AuditionState.EVALUATION

# EVALUATION → FULL
if status.state == AuditionState.EVALUATION:
if (status.session_count >= criteria.eval_min_sessions
and status.quality_percentile is not None
and status.quality_percentile >= criteria.eval_min_quality_percentile):
return AuditionState.FULL

return None # No transition

Shadow Mode Integration (ADR-027)

Critical Council Feedback: Audition models must NOT influence consensus until proven.

from llm_council.voting import VotingAuthority

# State → Voting Authority mapping
STATE_VOTING_AUTHORITY = {
AuditionState.SHADOW: VotingAuthority.ADVISORY, # Non-binding
AuditionState.PROBATION: VotingAuthority.ADVISORY, # Non-binding
AuditionState.EVALUATION: VotingAuthority.ADVISORY, # Non-binding until FULL
AuditionState.FULL: VotingAuthority.FULL, # Full voting rights
AuditionState.QUARANTINE: VotingAuthority.EXCLUDED, # Not selected
}


def get_voting_authority(model_id: str, tracker: AuditionTracker) -> VotingAuthority:
"""Get voting authority based on audition state."""
status = tracker.get_status(model_id)

if status is None:
# Unknown model - treat as new (SHADOW)
return VotingAuthority.ADVISORY

return STATE_VOTING_AUTHORITY.get(status.state, VotingAuthority.ADVISORY)

Selection with Audition (Revised)

Council Feedback: Code inconsistency fixed. Progressive weight replaces epsilon-greedy binary.

def select_with_audition(
scored_candidates: List[Tuple[str, float]],
tracker: AuditionTracker,
count: int = 4,
) -> List[str]:
"""
Select models with state-appropriate weighting.

Progressive weight approach (NOT epsilon-greedy):
- SHADOW/PROBATION: 30% selection weight
- EVALUATION: 30-100% (scaled by session count)
- FULL: 100% selection weight
"""
weighted_candidates = []

for model_id, score in scored_candidates:
status = tracker.get_status(model_id)
weight = get_selection_weight(status)
weighted_score = score * weight
weighted_candidates.append((model_id, weighted_score, weight))

# Sort by weighted score
weighted_candidates.sort(key=lambda x: -x[1])

# Enforce max audition seats (protect council quality)
selected = []
audition_count = 0
max_audition_seats = 1 # Only 1 audition model per session

for model_id, weighted_score, weight in weighted_candidates:
if len(selected) >= count:
break

# Is this an audition model (weight < 1.0)?
is_audition = weight < 1.0

if is_audition:
if audition_count >= max_audition_seats:
continue # Skip, already have max audition models
audition_count += 1

selected.append(model_id)

return selected


def get_selection_weight(status: Optional[AuditionStatus]) -> float:
"""
Get selection weight based on audition state.
Consistent with table definitions (fixes code/table mismatch).
"""
if status is None:
return 0.3 # New model: SHADOW weight

if status.state == AuditionState.SHADOW:
return 0.3

if status.state == AuditionState.PROBATION:
return 0.3

if status.state == AuditionState.EVALUATION:
# Scale from 0.3 to 1.0 based on session progress
# At 25 sessions (EVAL entry): 0.3
# At 50 sessions (FULL graduation): 1.0
progress = min(1.0, (status.session_count - 25) / 25)
return 0.3 + (0.7 * progress)

if status.state == AuditionState.FULL:
return 1.0

if status.state == AuditionState.QUARANTINE:
return 0.0 # Never select

return 0.3 # Default: cautious


# REMOVED: Epsilon-greedy approach (inconsistent with progressive weights)

Probationary Periods (Revised - Volume-Based)

StateSessionsMin DaysSelection WeightMax SeatsVoting
Shadow0-10330%1Advisory
Probation10-25730%1Advisory
Evaluation25-50-30-100%1Advisory
Full50+-100%AnyFull
Quarantine--0%0Excluded

Audition Safeguards

  1. Shadow Mode by default - New models don't affect consensus
  2. Volume-based graduation - Requires actual usage, not just time
  3. Max 1 audition per council session - Limit risk exposure
  4. Paired with proven models - Always have reliable responses
  5. Quality gate for FULL status - Must be top 25% to graduate
  6. Quarantine on failures - Automatic exclusion with cooldown
  7. Consensus exclusion - Audition models excluded from tie-breaking

Configuration

council:
model_intelligence:
audition:
enabled: true

# State progression (volume-based)
shadow:
min_sessions: 10
min_days: 3
max_failures: 3

probation:
min_sessions: 25
min_days: 7
max_failures: 5

evaluation:
min_sessions: 50
min_quality_percentile: 0.75

quarantine:
cooldown_hours: 24

# Selection limits
max_audition_seats: 1

Environment Variables

VariableTypeDefaultPurpose
LLM_COUNCIL_AUDITION_ENABLEDbooltrueEnable audition mechanism
LLM_COUNCIL_AUDITION_MAX_SEATSint1Max audition models per session
LLM_COUNCIL_AUDITION_SHADOW_SESSIONSint10Sessions before probation
LLM_COUNCIL_AUDITION_EVAL_SESSIONSint50Sessions for full graduation

Observability (Council Requirement)

# Metrics to emit
audition.state_transition{model_id, from_state, to_state}
audition.selection{model_id, state, selected}
audition.failure{model_id, state, failure_type}
audition.quarantine{model_id, reason, cooldown_hours}
audition.graduation{model_id, quality_percentile}

# Structured logging
{
"event": "audition_state_change",
"model_id": "openai/gpt-5.2",
"from_state": "PROBATION",
"to_state": "EVALUATION",
"session_count": 25,
"days_tracked": 8,
"quality_percentile": null
}

Consequences

Positive

  • Solves cold start problem for new models
  • Volume-based graduation ensures actual usage
  • Shadow Mode protects consensus from unproven models
  • Explicit state machine provides clear lifecycle
  • Quality gate prevents low-quality model graduation

Negative

  • Audition models contribute responses but not votes
  • Longer path to full status (50 sessions minimum)
  • Additional complexity in selection logic
  • Quarantine may miss temporary issues vs systemic

Risks & Mitigations

RiskMitigation
Audition model pollutes contextShadow Mode (advisory only)
Volume manipulation (spam requests)Min days requirement alongside sessions
Good model stuck in evaluationClear quality percentile metric
Quarantine too aggressiveConfigurable thresholds, auto-retry

Testing Strategy

class TestAuditionMechanism:
def test_state_transitions_volume_based(self):
"""State transitions require session count, not just time."""

def test_shadow_mode_no_voting(self):
"""Shadow state models have ADVISORY voting authority."""

def test_quarantine_on_failures(self):
"""Consecutive failures trigger quarantine."""

def test_quarantine_cooldown_and_retry(self):
"""Models retry from SHADOW after cooldown."""

def test_max_audition_seats(self):
"""Only 1 audition model selected per session."""

def test_quality_gate_for_full(self):
"""EVALUATION → FULL requires 75th percentile quality."""

def test_selection_weight_progression(self):
"""Weight scales 0.3 → 1.0 during EVALUATION."""

Implementation Plan

  1. Define AuditionState enum and AuditionStatus dataclass
  2. Implement state transition logic with volume-based criteria
  3. Add AuditionTracker with persistence (JSONL)
  4. Integrate with VotingAuthority from ADR-027
  5. Implement select_with_audition() with progressive weights
  6. Add metrics and structured logging
  7. Add configuration via YAML and env vars
  8. Add comprehensive tests

References