ADR-041: Verification Telemetry Wiring
Status: Accepted 2026-03-07 Date: 2026-03-07 Decision Makers: Chris Joseph, LLM Council Related: ADR-026 (Phase 3), ADR-040 (Phase 2 item 4)
Context
Two existing ADRs promise telemetry capabilities that are not functioning in production:
ADR-026 Phase 3: Dead Code
ADR-026 Phase 3 is marked IMPLEMENTED and delivered a complete performance/ module (~700 lines, 70 tests) with:
ModelSessionMetric— per-model per-session latency, Borda score, parse successInternalPerformanceTracker— rolling window aggregation with exponential decayModelPerformanceIndex— p50/p95 latency, mean Borda, confidence levelspersist_session_performance_data()— integration entry point- JSONL storage following the
bias_persistence.pypattern
The problem: persist_session_performance_data() is never called from any production code path. It is only called from tests. The performance store at ~/.llm-council/performance_metrics.jsonl does not exist on any installation. The entire Phase 3 module is dead code.
This means:
get_quality_score()always returns 50.0 (cold start) for every modelselect_tier_models()cannot use internal performance data for selection- The ADR-029 audition mechanism's quality percentile graduation gate (
>= 75th percentile) can never fire - ADR-026's validation gate ("100+ sessions tracked") has never been met
ADR-040 Phase 2 Item 4: Missing Observability
ADR-040 Phase 2 explicitly requires:
"Add observability metrics (stage durations, timeout frequency, estimated vs actual duration)"
With success criteria:
"P95 verification latency for high tier < 270s (4.5 min)"
And resolved question Q4:
"Add instrumentation to track
char_count -> actual_durationfor data-driven refinement"
The problem: The verification pipeline (verification/api.py) uses time.monotonic() internally for waterfall budgeting but discards all timing data. No per-stage duration is recorded in transcripts. No total elapsed time appears in result.json. The success criteria cannot be measured.
Additionally, ADR-040's deferred options depend on this data:
- Option E (Tiered Stage 2 Optimization): "Deferred until Options A+D provide observability data"
- Option F (Early Consensus Termination): Requires stage duration data to tune thresholds
Impact: High-Tier Timeouts Cannot Be Tuned
High-tier verifications frequently hit the global timeout (observed across 26 local transcript sessions, where timeout_fired and partial results indicate timeout-related failures). Without timing telemetry, we cannot:
- Determine which stage is the bottleneck (Stage 1 vs Stage 2 vs Stage 3)
- Identify which models are consistently slow
- Validate whether the 1.5x timeout multiplier is appropriate
- Measure whether waterfall budget ratios (50%/70%/remaining) are well-calibrated
- Compare
char_countto actual duration for input size limit refinement - Make data-driven decisions about ADR-040 Option E or F
Current Data Available
The only data source is verification transcripts (.council/logs/), which contain:
request.json— tier, paths, rubric_focus (no timing)stage1.json— model responses (no per-model latency)stage2.json— rankings (no per-reviewer latency)stage3.json— synthesis (no duration)result.json— verdict, confidence,timeout_fired(no elapsed time)
26 transcripts exist locally but none contain timing data.
Decision
Wire the existing telemetry infrastructure into the verification pipeline. This is primarily an integration task — the storage, aggregation, and analysis code already exists.
Phase 1: Transcript Timing (Verification Pipeline)
Record wall-clock timing in verification transcripts for immediate observability.
1.1 Per-stage timing in _run_verification_pipeline()
Initialize stage_timings at pipeline entry and record per-stage elapsed time. Timing must be captured even on timeout or error — this is the most valuable data for tuning.
# Initialize at pipeline entry
partial_state["stage_timings"] = {}
pipeline_start = time.monotonic()
# Per-stage pattern (repeated for each stage)
stage1_start = time.monotonic()
try:
# ... stage 1 execution ...
pass
finally:
# Always capture timing, even on exception/cancellation
partial_state["stage_timings"]["stage1"] = int((time.monotonic() - stage1_start) * 1000)
On global timeout (asyncio.CancelledError from wait_for), partial_state["stage_timings"] contains whatever stages completed plus the in-progress stage's partial timing. The finally block ensures the current stage's elapsed time is recorded before cancellation propagates.
1.2 Per-model latency in Stage 1 and Stage 2
Stage 1 already returns model_statuses with latency_ms from the gateway. Persist this in stage1.json.
Stage 2's asyncio.as_completed path (ADR-040) records per-reviewer timing by wrapping each task with a start timestamp and computing delta on completion:
# In stage2 as_completed path
task_start_times: Dict[str, float] = {}
for model in reviewers:
task_start_times[model] = time.monotonic()
# ... create task ...
# On completion
elapsed = int((time.monotonic() - task_start_times[model]) * 1000)
reviewer_timings[model] = elapsed
1.3 Summary timing in result.json
Add to transcript result. budget_utilization is defined as total_elapsed_ms / global_deadline_ms (ratio of time used to time available; always in [0, 1] for non-timeout cases, can exceed 1.0 if timeout fires mid-cleanup):
{
"timing": {
"total_elapsed_ms": 142000,
"stage1_elapsed_ms": 45000,
"stage2_elapsed_ms": 78000,
"stage3_elapsed_ms": 19000,
"global_deadline_ms": 270000,
"timeout_fired": false,
"budget_utilization": 0.53
},
"input_metrics": {
"content_chars": 32000,
"tier_max_chars": 50000,
"num_models": 4,
"num_reviewers": 4
}
}
On timeout, partially-completed stages appear with their elapsed time. Missing stages are omitted (not null), so consumers check key existence:
{
"timing": {
"total_elapsed_ms": 270000,
"stage1_elapsed_ms": 45000,
"stage2_elapsed_ms": 225000,
"global_deadline_ms": 270000,
"timeout_fired": true,
"budget_utilization": 1.0
}
}
Phase 2: Performance Tracker Integration
Wire persist_session_performance_data() into the verification pipeline so the ADR-026 Phase 3 tracker actually accumulates data.
2.1 Call site in run_verification()
The existing function signature (from performance/integration.py) is:
def persist_session_performance_data(
session_id: str,
model_statuses: Dict[str, Dict[str, Any]], # model_id -> {"latency_ms": int, ...}
aggregate_rankings: Dict[str, Dict[str, Any]], # model_id -> {"borda_score": float, ...}
stage2_results: Optional[List[Dict[str, Any]]] = None,
) -> int:
The verification pipeline produces these exact shapes:
model_statusesfromstage1_collect_responses_with_status()— confirmed compatibleaggregate_rankingsfromcalculate_aggregate_rankings()— confirmed compatiblestage2_resultsfromstage2_collect_rankings()— confirmed compatible
Call in a finally block after pipeline completion, wrapped in try/except to ensure telemetry failures never fail the verification:
# In run_verification(), after pipeline completes (or times out)
try:
persist_session_performance_data(
session_id=verification_id,
model_statuses=partial_state.get("model_statuses", {}),
aggregate_rankings=partial_state.get("aggregate_rankings", {}),
stage2_results=partial_state.get("stage2_results"),
)
except Exception:
logger.warning("Telemetry persistence failed", exc_info=True)
Reliability constraint: Telemetry persistence must never fail the verification pipeline. A high-tier verification taking minutes should never crash at the last step because a JSONL write failed (disk full, permissions, etc.).
2.2 Call site in consult_council MCP tool (Deferred)
The main council deliberation path in mcp_server.py also never calls the tracker. This is a separate integration with different data shapes and session semantics. Deferred to a follow-up task to keep this ADR focused on the verification pipeline.
Concurrency note: JSONL append writes are atomic on POSIX for lines under PIPE_BUF (4096 bytes on Linux/macOS). ModelSessionMetric.to_jsonl_line() produces ~200 bytes per record, well under this limit. The existing bias_persistence.py uses the same pattern in production. No file locking is required.
Phase 3: Analysis CLI (Deferrable)
Extend the existing bias-report CLI pattern to add a timing-report command. This phase is a convenience tool and is explicitly deferrable — raw JSONL and transcript data can be analyzed with jq in the interim.
llm-council timing-report [--days N] [--tier TIER] [--format text|json]
Output:
- Per-tier P50/P95/P99 total elapsed time
- Per-stage P50/P95 breakdown
- Per-model P50/P95 latency
- Timeout frequency by tier
char_countvsactual_durationcorrelation- Budget utilization distribution
What This Enables
Once data accumulates (target: 30+ sessions for PRELIMINARY confidence):
- Timeout tuning: Adjust
VERIFICATION_TIMEOUT_MULTIPLIERbased on P95 data - Budget ratio tuning: Adjust 50%/70%/remaining waterfall based on actual stage proportions
- Input limit refinement: Adjust
TIER_MAX_CHARSbased onchar_count -> durationcorrelation - Model selection feedback:
select_tier_models()uses real latency/quality data - ADR-040 Option E: Data-driven Stage 2 optimization decisions
- ADR-029 graduation: Audition quality percentile gate becomes functional
ADR Updates Required
- ADR-026 Phase 3: Add "Known Gaps" subsection noting that wiring is incomplete, pointing to ADR-041
- ADR-040 Phase 2: Add implementation status for item 4 pointing to ADR-041
Consequences
Positive
- Existing investment pays off: ~700 lines of performance tracking code becomes functional
- Data-driven timeout tuning: Replace guesswork with measured P95 latencies
- ADR-040 success criteria measurable: Can finally validate "P95 < 270s for high tier"
- ADR-029 audition unblocked: Quality percentile graduation gate becomes functional
- Low risk: No new abstractions — wiring existing code to existing call sites
Negative
- Disk usage: JSONL files grow over time (~200 bytes per model per session). Data retention/rotation is deferred — at current usage rates (~30 sessions/month), annual growth is ~300KB, which does not warrant a rotation strategy yet.
- Minor latency: File I/O for persistence adds ~1-2ms per session (negligible vs 60-270s sessions)
- Migration: Old transcripts lack timing data; analysis tools must handle missing
timingfield gracefully - Test impact: Integration tests that run real verification will create JSONL files; test fixtures should use temp directories with cleanup
Neutral
- Additive VerifyResponse schema: Optional
timingandinput_metricsfields added toVerifyResponse(backward compatible — both default toNone) - Backward compatible: transcript
result.jsongains newtimingfield (additive)
Compliance / Validation
- Transcript timing test: Verify
result.jsoncontainstimingobject withtotal_elapsed_ms,global_deadline_ms,budget_utilization, and at least onestageN_elapsed_msfield - Timeout timing test: Verify that on global timeout,
result.jsonstill contains partialtimingdata for completed and in-progress stages - Performance store test: Verify
~/.llm-council/performance_metrics.jsonlis created and populated after a verification session - Telemetry isolation test: Verify that
persist_session_performance_data()failure (simulated via mock raising IOError) does not affect the verification result - Integration test: Run 5 verification sessions, verify timing values are non-negative, stage times sum to approximately
total_elapsed_ms(within 100ms measurement tolerance), andbudget_utilizationis in [0, 1] for non-timeout cases - ADR-040 validation (runtime, not code gate): After 30+ sessions, measure P95 high-tier latency against 270s target
Council Deliberation (2026-03-07, Reasoning Tier)
Models consulted: gemini-3.1-pro-preview, claude-opus-4.6, deepseek-v3.2-speciale Consensus level: Moderate (2 approved, 1 needs-review)
Council feedback incorporated:
- Error/timeout path timing — Added
finallyblocks for stage timing capture, explicit timeout transcript example, and reliability constraint for persistence (all 3 reviewers) - Function signature verification — Added explicit signature from
performance/integration.pywith compatibility confirmation (Opus, DeepSeek) partial_state["stage_timings"]initialization — Added explicit initialization at pipeline entry (Gemini, DeepSeek)- Phase 2.2 scoping — Deferred
consult_councilwiring to follow-up task (Opus) budget_utilizationformula — Defined astotal_elapsed_ms / global_deadline_mswith edge case behavior (Opus)- Concurrent write safety — Added POSIX atomic append justification with
PIPE_BUFanalysis (Opus) - Phase 3 deferrable — Explicitly marked as deferrable with
jqinterim workaround (Opus) - Telemetry isolation — Added try/except wrapper and compliance test for persistence failure (Gemini, Opus)
- Data retention — Added explicit deferral with growth estimate (Opus)
- Test impact — Added note about temp directory fixtures (Opus)
- Claim substantiation — Reworded "users report" to reference observed transcript data (Opus)