Skip to main content

ADR-020: Not Diamond Integration Strategy for LLM Council Ecosystem

Status: Tier 1-3 Implemented (v0.12.0) Date: 2025-12-19 (Updated: 2025-12-22) Decision Makers: Engineering, Architecture Council Review: Completed (GPT-5.2-pro, Claude Opus 4.5, Gemini 3 Pro, Grok-4) Layer Assignment: Layer 2 - Query Triage & Model Selection (per ADR-024)


Layer Context (ADR-024)

This ADR operates at Layer 2 in the unified routing architecture:

LayerADRResponsibility
L1ADR-022Tier Selection (quick/balanced/high/reasoning)
L2ADR-020Query Triage & Model Selection
L3CoreCouncil Execution (Stage 1-3)
L4ADR-023Gateway Routing

Interaction Rules:

  • Layer 2 receives TierContract from Layer 1 (models MUST come from TierContract.allowed_pools)
  • Layer 2 can RECOMMEND tier escalation, not force it
  • Escalation requires explicit user notification
  • Layer 2 outputs TriageResult to Layer 3

Implementation Status

TierComponentStatus
Tier 3Wildcard SelectionImplemented (v0.8.0)
Tier 2Prompt OptimizationImplemented (v0.8.0)
Tier 1Confidence-Gated Fast PathImplemented (v0.12.0)
Tier 1Shadow Council SamplingImplemented (v0.12.0)
Tier 1Rollback Metric TrackingImplemented (v0.12.0)
Tier 1Not Diamond API IntegrationImplemented (v0.12.0) - Optional, graceful fallback to heuristics

Context

Not Diamond (notdiamond.ai) offers AI optimization capabilities that could complement or enhance the LLM Council ecosystem:

Not Diamond Capabilities

CapabilityDescription
Prompt AdaptationAutomatically optimizes prompts to improve accuracy and adapt to new models
Model RoutingIntelligent query routing based on complexity, cost, and latency preferences
Custom RoutersTrain personalized routing models on evaluation datasets
Multi-SDK SupportPython, TypeScript, and REST API integrations

Current LLM Council Architecture

The LLM Council uses a 3-stage deliberation process:

  1. Stage 1: Parallel queries to 4 fixed models (GPT-5.2-pro, Gemini 3 Pro, Claude Opus 4.5, Grok-4)
  2. Stage 2: Anonymized peer review with Borda count ranking
  3. Stage 3: Chairman synthesis of consensus response

Integration Opportunities

SystemPotential Integration
llm-council (library)Pre-council routing, prompt optimization, dynamic model selection
council-cloud (service)Tiered routing for cost optimization, A/B testing model combinations

Decision

Implement a Hybrid Augmentation Strategy where Not Diamond enhances rather than replaces the council's consensus mechanism.

┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: TRIAGE LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ Not Diamond Complexity Classifier │
│ ├── Simple queries → Single best model (bypass council, 70% cost savings) │
│ ├── Medium queries → Lite council (2 models + synthesis) │
│ └── Complex queries → Full council (4 models + peer review) │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: PROMPT OPTIMIZATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ Not Diamond Prompt Adaptation (applied before council stage) │
│ ├── Reduces variance between model responses │
│ ├── Improves consensus quality │
│ └── Adapts queries to model-specific strengths │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 3: DYNAMIC COUNCIL │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3 Fixed Anchor Models + 1 Wildcard Seat │
│ ├── Anchors: GPT-5.2-pro, Gemini 3 Pro, Claude Opus 4.5 │
│ └── Wildcard: Not Diamond selects specialist (coding, math, creative) │
└─────────────────────────────────────────────────────────────────────────────┘

Integration Points

1. Pre-Council Triage (council-cloud)

# Pseudocode for tiered routing
async def route_query(query: str) -> CouncilResponse:
complexity = await not_diamond.classify_complexity(query)

if complexity == "simple":
model = await not_diamond.route(query, optimize="cost")
return await query_single_model(model, query)

elif complexity == "medium":
return await run_lite_council(query, models=2)

else: # complex
return await run_full_council(query)

Expected Savings: 60-80% cost reduction on simple queries (estimated 40% of traffic)

2. Prompt Optimization (llm-council)

# Apply before Stage 1
async def stage1_with_optimization(query: str) -> List[Response]:
optimized = await not_diamond.adapt_prompt(
prompt=query,
target_models=COUNCIL_MODELS,
metric="semantic_similarity"
)
return await stage1_collect_responses(optimized)

Expected Benefit: Reduced response variance, improved consensus quality

3. Dynamic Wildcard Selection (llm-council)

# Configuration option for dynamic 4th seat
COUNCIL_MODELS = [
"openai/gpt-5.2-pro", # Anchor
"google/gemini-3-pro", # Anchor
"anthropic/claude-opus-4.5", # Anchor
"dynamic:not-diamond", # Wildcard - selected per query
]

Expected Benefit: Specialized expertise without manual model selection


API Integration Details

Not Diamond Endpoints

EndpointPurposeIntegration Point
POST /v2/prompt/adaptPrompt optimizationPre-Stage 1
POST /v2/modelRouter/modelSelectModel routingTriage layer
POST /v2/pzn/trainCustomRouterCustom router trainingcouncil-cloud analytics

Authentication

export NOT_DIAMOND_API_KEY="your-key"

SDK Installation

pip install notdiamond  # Python SDK

Alternatives Considered

Alternative 1: Full Replacement (Use Not Diamond Instead of Council)

Rejected: Loses the peer review consensus mechanism that provides:

  • Hallucination reduction through cross-validation
  • Nuanced disagreement detection
  • Transparency via inspectable rankings

Alternative 2: No Integration (Status Quo)

Rejected: Misses cost optimization opportunities. Current fixed council costs ~$0.08-0.15 per query regardless of complexity.

Alternative 3: Post-Council Routing Only

Rejected: Doesn't address the primary cost driver (running all 4 models on every query).


Implementation Phases

Phase 1: Evaluation (2 weeks)

  • A/B test Not Diamond routing accuracy vs. council consensus
  • Measure latency overhead of routing layer
  • Calculate cost savings on production traffic sample

Phase 2: Triage Integration (3 weeks)

  • Implement complexity classifier wrapper
  • Add LLM_COUNCIL_ROUTING_MODE config option
  • Create lite council mode (2 models)

Phase 3: Prompt Optimization (2 weeks)

  • Integrate prompt adaptation API
  • Benchmark variance reduction
  • Add LLM_COUNCIL_PROMPT_OPTIMIZATION config option

Phase 4: Dynamic Wildcard (3 weeks)

  • Implement wildcard model selection
  • Train custom router on council evaluation data
  • A/B test wildcard vs. fixed 4th seat

Risks and Mitigations

RiskLikelihoodImpactMitigation
Routing accuracy < council accuracyMediumHighA/B test with fallback to full council
Added latency from routing layerMediumMediumCache routing decisions for similar queries
Vendor lock-in to Not DiamondLowMediumAbstract behind interface, maintain bypass option
Prompt optimization changes intentLowHighValidate optimized prompts preserve semantics

Success Metrics

MetricTargetMeasurement
Cost per query (avg)-50%Compare pre/post monthly costs
Simple query latency-60%P50 latency for routed queries
Council consensus qualityNo degradationMeasure disagreement rate
Routing accuracy>95%Compare routed model vs. council winner

Configuration Options

# Triage routing mode
LLM_COUNCIL_ROUTING_MODE=auto|full|lite|bypass # default: full

# Prompt optimization
LLM_COUNCIL_PROMPT_OPTIMIZATION=true|false # default: false

# Dynamic wildcard
LLM_COUNCIL_WILDCARD_MODEL=dynamic|<model-id> # default: x-ai/grok-4

# Not Diamond API key (required for routing/optimization)
NOT_DIAMOND_API_KEY=<key>

Council Review Summary

Status: CONDITIONAL ACCEPTANCE (Requires architectural changes to Tier 1 and Tier 2)

Reviewed by: GPT-5.2-pro (109s), Claude Opus 4.5 (64s), Gemini 3 Pro (31s), Grok-4 (72s)

Council Verdict: All 4 models responded. Unanimous agreement on:

  • Tier 3 (Wildcard): APPROVED - Strongest value proposition
  • Tier 2 (Prompt Optimization): APPROVED WITH MODIFICATIONS
  • Tier 1 (Triage): REDESIGN REQUIRED - Primary risk to consensus integrity

Consensus Answers to Key Questions

1. Does triage routing compromise consensus quality?

Yes, significantly—if implemented as a simple filter.

All respondents agree that pre-classifying a query as "simple" is error-prone. The council unanimously recommends:

  • Confidence-Gated Routing: Don't route based on query complexity; route based on model's self-reported confidence AFTER attempting an answer
  • Shadow Council: Random 5% sampling of triaged queries through full council to measure "regret rate"
  • Escalation Threshold: Only bypass council if confidence > 0.92 AND complexity_score < 0.3
tier_1_redesign:
name: "Confidence-Gated Fast Path"
flow:
1. Route to single model via Not Diamond
2. Model responds WITH calibrated confidence score
3. Gate decision:
- confidence >= 0.92 AND low_risk: Return single response
- else: Escalate to full council
audit: 5% shadow council sampling
cost_savings: ~45-55% (lower but safer than 70%)

2. Is prompt optimization valuable for multi-model scenarios?

Yes, but only as "Translation," not "Rewriting."

Council consensus: Apply Per-Model Adaptation while maintaining semantic equivalence.

  • Do: Format prompts for model preferences (Claude's XML vs GPT's Markdown)
  • Don't: Globally optimize prompts that may favor one model's style
  • Constraint: Verify semantic equivalence (cosine similarity > 0.93) across adapted prompts
# Council-recommended approach
class CouncilPromptOptimizer:
def optimize(self, prompt: str, models: list) -> dict:
canonical = self.extract_intent(prompt) # Immutable core
adapted = {m: self.adapt_syntax(canonical, m) for m in models}
if not self.verify_equivalence(adapted.values()):
return {m: prompt for m in models} # Fallback to original
return adapted

3. Should the wildcard seat be domain-specialized or general?

Domain-Specialized (unanimous)

Adding another generalist yields diminishing returns. Council recommends:

  • Specialist Pool: DeepSeek (code), Med-PaLM (health), o1-preview (reasoning)
  • Diversity Constraint: Wildcard must differ from base council on model family, training methodology, or architecture
  • Fallback: If specialist unavailable, use quantized generalist (Llama 3)

4. What risks are underestimated?

RiskSeverityMitigation
Latency StackingHighPre-warm specialists, cache optimizations
Router Bias/DriftHighShadow evaluation, drift monitoring
Correlated FailuresMediumEnforce diversity constraints on wildcard
Aggregation BreakageMediumRequire consistent JSON schema from all seats
Security (Prompt Injection)MediumRouting decisions should not be influenced by query content
Reproducibility GapsMediumLog task_spec_id, routing decision, model versions

Architectural Recommendations from Council

1. Redesign Tier 1: "Confidence Circuit" (not complexity filter)

Request → [Single Model] → Check Confidence
If Confidence > 90% AND Safety_Flag == False → Return Response
Else → Forward to [Full Council]

Audit: 5% shadow sampling to measure regret rate
Rollback trigger: shadow_council_disagreement_rate > 8%

2. Refine Tier 2: "Adapter Pattern" (not rewriting)

  • Create Canonical Task Spec (immutable intent)
  • Apply syntactic adapters per model
  • Verify semantic equivalence before sending
  • Monitor for consensus rate changes (either direction is suspicious)

3. Harden Tier 3: "Specialist Pool" with constraints

wildcard_configuration:
pool:
- code: deepseek-v3, codestral
- reasoning: o1-preview, deepseek-r1
- creative: claude-opus, command-r
- multilingual: gpt-4, command-r

constraints:
- must_differ_from_base_council: [family, training, architecture]
- timeout_fallback: llama-3-70b
- max_selection_latency: 200ms

4. Council Orchestrator Architecture

1. Ingress → Authenticate, rate-limit, attach tenant policy
2. Normalize → Canonical Task Spec (log task_spec_id)
3. Safety Gate → Classify risk tier, enforce allowed models
4. Triage Decision → Not Diamond + local heuristics
5. Execute → Single model OR council (3 fixed + 1 wildcard)
6. Aggregate → Structured decision with dissent summary
7. Post-process → Schema validation, emit metrics, store audit bundle

Implementation Revision (Council-Informed)

PhaseOriginalCouncil Revision
Phase 1Evaluation (2 weeks)Tier 3 only - Lowest risk, highest value
Phase 2Triage IntegrationAdd Tier 2 - Prompt optimization with adapters
Phase 3Prompt OptimizationTier 1 v2 - Confidence-gated routing with heavy monitoring
Phase 4Dynamic WildcardContinuous optimization bounded by SLOs

Rollback Triggers (Council-Defined)

automatic_rollback:
tier_1:
- shadow_council_disagreement_rate > 8%
- user_escalation_rate > 15%
- error_report_rate > baseline * 1.5
tier_2:
- consensus_rate_change > 10%
- prompt_divergence > 0.2
tier_3:
- wildcard_timeout_rate > 5%
- wildcard_disagreement_rate < 20% # Not adding value if always agrees

References

Other References