ADR-015: Bias Auditing and Length Correlation Tracking

Status: Accepted → Implemented (2025-12-17) Date: 2025-12-13 Decision Makers: Engineering Related: ADR-010 (Consensus Mechanisms), ADR-014 (Verbosity Penalty)

Context

ADR-010 recommended tracking "correlation between scores and response length" as part of bias auditing. This provides empirical data to:

Measure the effectiveness of ADR-014 (verbosity penalty prompts)
Detect systematic biases in reviewer models
Identify reviewers that consistently favor certain patterns

The Problem

Without metrics, we cannot:

Know if verbosity bias exists (or how severe it is)
Measure if our mitigations (ADR-014) are working
Detect other systematic biases (e.g., style preferences, position bias)
Compare reviewer reliability across models

What to Measure

Metric	Description	Expected Healthy Range
Length-Score Correlation	Pearson correlation between word count and score	-0.2 to 0.2 (weak)
Position Bias	Score variation by presentation order (A, B, C...)	< 5% difference
Reviewer Calibration	Mean and variance of scores per reviewer	Similar across reviewers
Self-Vote Inflation	Score given to own response vs. others	Should be excluded

Decision

Implement a per-session bias auditing module that calculates and reports correlation metrics for each council session.

Scope and Limitations

This ADR implements per-session bias indicators only. With typical council sizes of 4-5 models:

Metric	Data Points	Minimum for Significance	Status
Length correlation	4-5 pairs	30+ pairs	Indicator only
Position bias	1 ordering	20+ orderings	Indicator only
Reviewer calibration	N*(N-1) scores	50+ scores/reviewer	Indicator only

These metrics can detect extreme single-session anomalies (e.g., r > 0.9) but should not be interpreted as statistically robust proof of systematic bias. They are diagnostic indicators, not statistical evidence.

Proposed Implementation

# bias_audit.py
import numpy as np
from dataclasses import dataclass
from typing import Dict, List, Optional
from scipy import stats

@dataclass
class BiasAuditResult:
    """Results from bias analysis of a council session."""

    # Length bias
    length_score_correlation: float  # Pearson r
    length_score_p_value: float      # Statistical significance
    length_bias_detected: bool       # |r| > 0.3 and p < 0.05

    # Position bias (if randomization data available)
    position_score_variance: Optional[float]
    position_bias_detected: Optional[bool]

    # Reviewer calibration
    reviewer_mean_scores: Dict[str, float]
    reviewer_score_variance: Dict[str, float]
    harsh_reviewers: List[str]      # Mean score < median - 1 std
    generous_reviewers: List[str]   # Mean score > median + 1 std

    # Summary
    overall_bias_risk: str  # "low", "medium", "high"


def calculate_length_correlation(
    responses: List[Dict],
    scores: Dict[str, Dict[str, float]]
) -> tuple[float, float]:
    """
    Calculate Pearson correlation between response length and average score.

    Args:
        responses: List of {model, response} dicts
        scores: {reviewer: {candidate: score}} nested dict

    Returns:
        (correlation coefficient, p-value)
    """
    # Calculate word counts
    word_counts = {r['model']: len(r['response'].split()) for r in responses}

    # Calculate average score per response
    avg_scores = {}
    for candidate in word_counts.keys():
        candidate_scores = [
            s[candidate] for s in scores.values()
            if candidate in s
        ]
        if candidate_scores:
            avg_scores[candidate] = np.mean(candidate_scores)

    # Align data
    models = list(avg_scores.keys())
    lengths = [word_counts[m] for m in models]
    score_values = [avg_scores[m] for m in models]

    if len(models) < 3:
        return 0.0, 1.0  # Insufficient data

    r, p = stats.pearsonr(lengths, score_values)
    return r, p


def audit_reviewer_calibration(
    scores: Dict[str, Dict[str, float]]
) -> Dict[str, Dict[str, float]]:
    """
    Analyze score calibration across reviewers.

    Returns:
        {reviewer: {mean, std, count}}
    """
    calibration = {}
    for reviewer, reviewer_scores in scores.items():
        values = list(reviewer_scores.values())
        if values:
            calibration[reviewer] = {
                "mean": np.mean(values),
                "std": np.std(values),
                "count": len(values)
            }
    return calibration


def run_bias_audit(
    stage1_responses: List[Dict],
    stage2_scores: Dict[str, Dict[str, float]],
    position_mapping: Optional[Dict[str, int]] = None
) -> BiasAuditResult:
    """
    Run full bias audit on a council session.

    Args:
        stage1_responses: List of {model, response} from Stage 1
        stage2_scores: {reviewer: {candidate: score}} from Stage 2
        position_mapping: Optional {candidate: position_shown} for position bias

    Returns:
        BiasAuditResult with all metrics
    """
    # Length correlation
    r, p = calculate_length_correlation(stage1_responses, stage2_scores)
    length_bias = abs(r) > 0.3 and p < 0.05

    # Reviewer calibration
    calibration = audit_reviewer_calibration(stage2_scores)
    means = [c["mean"] for c in calibration.values()]
    median_mean = np.median(means) if means else 5.0
    std_mean = np.std(means) if len(means) > 1 else 1.0

    harsh = [r for r, c in calibration.items() if c["mean"] < median_mean - std_mean]
    generous = [r for r, c in calibration.items() if c["mean"] > median_mean + std_mean]

    # Position bias (if data available)
    position_variance = None
    position_bias = None
    if position_mapping:
        # Group scores by position
        position_scores = {}
        for reviewer, scores in stage2_scores.items():
            for candidate, score in scores.items():
                pos = position_mapping.get(candidate)
                if pos is not None:
                    position_scores.setdefault(pos, []).append(score)

        if position_scores:
            position_means = [np.mean(s) for s in position_scores.values()]
            position_variance = np.var(position_means)
            position_bias = position_variance > 0.5  # Threshold TBD

    # Overall risk assessment
    risk_factors = sum([
        length_bias,
        position_bias or False,
        len(harsh) > 0,
        len(generous) > 0
    ])
    overall_risk = "low" if risk_factors == 0 else "medium" if risk_factors <= 2 else "high"

    return BiasAuditResult(
        length_score_correlation=round(r, 3),
        length_score_p_value=round(p, 4),
        length_bias_detected=length_bias,
        position_score_variance=round(position_variance, 3) if position_variance else None,
        position_bias_detected=position_bias,
        reviewer_mean_scores={r: round(c["mean"], 2) for r, c in calibration.items()},
        reviewer_score_variance={r: round(c["std"], 2) for r, c in calibration.items()},
        harsh_reviewers=harsh,
        generous_reviewers=generous,
        overall_bias_risk=overall_risk
    )

Integration with Council Pipeline

# council.py additions
from llm_council.bias_audit import run_bias_audit

async def run_full_council(user_query: str, ...):
    # ... existing Stage 1, 2, 3 logic ...

    # Optional: Run bias audit
    if BIAS_AUDIT_ENABLED:
        audit_result = run_bias_audit(
            stage1_results,
            extract_scores_from_stage2(stage2_results),
            position_mapping=label_to_position  # Track original position
        )
        metadata["bias_audit"] = asdict(audit_result)

    return stage1, stage2, stage3, metadata

Configuration

# config.py
DEFAULT_BIAS_AUDIT_ENABLED = False  # Off by default (adds latency)
BIAS_AUDIT_ENABLED = os.getenv("LLM_COUNCIL_BIAS_AUDIT", "false").lower() == "true"

# Thresholds
LENGTH_CORRELATION_THRESHOLD = 0.3  # |r| above this = bias detected
POSITION_VARIANCE_THRESHOLD = 0.5   # Score variance by position

Output Example

{
  "bias_audit": {
    "length_score_correlation": 0.42,
    "length_score_p_value": 0.023,
    "length_bias_detected": true,
    "position_score_variance": 0.12,
    "position_bias_detected": false,
    "reviewer_mean_scores": {
      "openai/gpt-4": 6.2,
      "anthropic/claude": 7.8,
      "google/gemini": 7.1
    },
    "harsh_reviewers": ["openai/gpt-4"],
    "generous_reviewers": [],
    "overall_bias_risk": "medium"
  }
}

Alternatives Considered

Alternative 1: External Analytics Only

Log raw data and analyze offline.

Partially Adopted: We should log raw data for deeper analysis, but real-time metrics are valuable for immediate feedback.

Alternative 2: Automatic Score Adjustment

Automatically adjust scores based on detected bias.

Rejected: Too aggressive. Better to report bias and let users decide how to respond.

Alternative 3: Reviewer Weighting

Weight reviewers by historical calibration accuracy.

Deferred: Requires historical data collection first. Can be added later based on audit results.

Risks and Mitigations

Risk	Mitigation
Adds latency to pipeline	Make audit optional, calculate asynchronously
False positives with small N	Require minimum sample size, show confidence
Overcomplicates output	Put audit in metadata, not main response

Questions for Council Review

Are the proposed metrics comprehensive? What's missing?
What thresholds should trigger bias warnings?
Should bias audit results affect the synthesis or just be logged?
Should we track reviewer reliability over time (requires persistence)?

Council Review Feedback

Reviewed: 2025-12-17 (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5, Grok 4)

Verdict: Approved with Enhancements

The council unanimously approved ADR-015's approach to bias auditing, with recommendations to expand the bias taxonomy.

Approved Elements

Element	Council Assessment
Length-score correlation	Sound methodology;
Reviewer calibration	Z-normalization is industry standard
Per-session logging	Correct starting point before persistence
Non-invasive design	Important—don't auto-adjust scores

Additional Biases to Track

The council identified biases not covered in the original proposal:

Bias Type	Description	Detection Method
Self-Consistency	Same model gives different scores to identical content across sessions	Run duplicate queries, measure score variance
Anchoring Bias	First response seen anchors expectations for subsequent responses	Track correlation between position and deviation from mean
Style Mimicry	Reviewers prefer responses similar to their own output style	Embed responses, measure similarity to reviewer's typical style
Sycophancy Detection	Models rating competitors harshly while being lenient on "friendly" models	Track reviewer→candidate score patterns
Reasoning Chain Bias	Longer reasoning ≠ better reasoning but may be rated higher	Track reasoning length vs. answer correctness

Implementation Recommendations

Phase 1: Implement core metrics (length, position, calibration) per session
Phase 2: Add self-consistency testing (periodic duplicate queries)
Phase 3: Implement cross-session persistence for reviewer profiling
Phase 4: Build bias dashboard for visualization

Statistical Considerations

"With N=4 models, statistical power is very low. Correlation coefficients should be interpreted with extreme caution. Consider bootstrapping confidence intervals rather than p-values."

Minimum sample sizes for reliable detection:

Length correlation: N ≥ 10 responses per session
Position bias: N ≥ 20 sessions with position tracking
Reviewer calibration: N ≥ 50 reviews per reviewer

Dependencies

ADR	Dependency Type
~~ADR-017 (Position Tracking)~~	Implemented Inline - Position data derived from `label_to_model` via `derive_position_mapping()`
ADR-016 (Rubric Scoring)	Enhances - Per-dimension bias analysis possible with rubric
ADR-014 (Verbosity Penalty)	Superseded - Bias auditing replaces manual penalty

Priority: HIGH - Foundation for data-driven evaluation improvements. ~~Implement after ADR-016/017.~~ Implemented 2025-12-17.

Implementation Status (2025-12-17)

Completed Components

Component	File	Tests
`BiasAuditResult` dataclass	`src/llm_council/bias_audit.py`	2 tests
`calculate_length_correlation()`	`src/llm_council/bias_audit.py`	6 tests
`audit_reviewer_calibration()`	`src/llm_council/bias_audit.py`	4 tests
`calculate_position_bias()`	`src/llm_council/bias_audit.py`	4 tests
`derive_position_mapping()`	`src/llm_council/bias_audit.py`	8 tests
`extract_scores_from_stage2()`	`src/llm_council/bias_audit.py`	4 tests
`run_bias_audit()`	`src/llm_council/bias_audit.py`	5 tests
Pipeline integration	`src/llm_council/council.py`	-
Configuration	`src/llm_council/config.py`	-

Key Implementation Details

Pure Python - No scipy/numpy dependency; uses statistics module
Position Mapping - Derived from label_to_model (Response A → 0, Response B → 1, etc.)
Pearson Correlation - Custom implementation with Abramowitz-Stegun normal CDF approximation
Configurable Thresholds:
- LLM_COUNCIL_LENGTH_CORRELATION_THRESHOLD (default: 0.3)
- LLM_COUNCIL_POSITION_VARIANCE_THRESHOLD (default: 0.5)

Position Mapping Implementation

Position information is derived from the anonymization label mapping rather than tracked separately.

Invariant (Council-Recommended)

INVARIANT: Anonymization labels SHALL be assigned in lexicographic order corresponding to presentation order:

Label "Response A" → position 0 (first presented)

Label "Response B" → position 1 (second presented)

And so forth lexicographically

This invariant MUST be maintained by any changes to the anonymization module.

Enhanced Data Structure

To eliminate fragility of string parsing, the label_to_model structure includes explicit display_index:

# Enhanced label_to_model structure (v0.3.0+)
{
    "Response A": {"model": "openai/gpt-4", "display_index": 0},
    "Response B": {"model": "anthropic/claude-3", "display_index": 1},
    "Response C": {"model": "google/gemini-pro", "display_index": 2}
}

# derive_position_mapping() output
{"openai/gpt-4": 0, "anthropic/claude-3": 1, "google/gemini-pro": 2}

Backward Compatibility

The derive_position_mapping() function supports both formats:

Enhanced format: Uses display_index directly (preferred)
Legacy format: Falls back to parsing position from label letter (A=0, B=1)

This enables position bias detection without requiring a separate position tracking system. See ADR-017: Response Order Randomization for the randomization approach that complements this detection.

Success Metrics

Bias audit runs without adding >100ms latency
Length-score correlation decreases after ADR-014 implementation
Users can identify and respond to systematic biases

Context​

The Problem​

What to Measure​

Decision​

Scope and Limitations​

Proposed Implementation​

Integration with Council Pipeline​

Configuration​

Output Example​

Alternatives Considered​

Alternative 1: External Analytics Only​

Alternative 2: Automatic Score Adjustment​

Alternative 3: Reviewer Weighting​

Risks and Mitigations​

Questions for Council Review​

Council Review Feedback​

Verdict: Approved with Enhancements​

Approved Elements​

Additional Biases to Track​

Implementation Recommendations​

Statistical Considerations​

Dependencies​

Implementation Status (2025-12-17)​

Completed Components​

Key Implementation Details​

Position Mapping Implementation​

Invariant (Council-Recommended)​

Enhanced Data Structure​

Backward Compatibility​

Success Metrics​