Skip to main content

ADR-021: Quint Code and First Principles Framework (FPF) Integration

Status: Proposed Date: 2025-12-19 Decision Makers: Engineering, Architecture Council Review: Pending (GPT-5.2-pro, Claude Opus 4.5, Gemini 3 Pro, Grok-4)


Context

Two complementary frameworks have emerged for structured AI-assisted reasoning:

Quint Code (github.com/m0n0x41d/quint-code)

A structured reasoning framework for AI-assisted development that creates auditable decision trails. Implements the First Principles Framework (FPF) methodology through:

ComponentDescription
Abduction PhaseGenerate 3-5 competing hypotheses (stored in L0/)
Deduction PhaseVerify logical consistency, promote to L1/
Induction PhaseGather empirical evidence, promote to L2/
Trust ScoringWeakest-link (WLNK) assurance model
Bias DetectionFlags anchoring bias and early-hypothesis privilege
Design Rationale RecordsAuditable decision artifacts with expiry conditions

Integration: Works via MCP protocol with Claude Code, Cursor, Gemini CLI, Codex CLI.

First Principles Framework (FPF) (github.com/ailev/FPF)

A transdisciplinary "Operating System for Thought" providing:

ComponentDescription
Holonic FoundationEverything as whole and part simultaneously
Trust FormulaTrust = ⟨F, G, R⟩ (Formality, Granularity, Reliability)
Γ-AlgebraUniversal aggregation preserving invariants
Bounded ContextsTerms hold meaning only within defined boundaries
LLM IntegrationFunctions as "bias-assistant" steering toward first-principles

Functional Alignment Analysis

Conceptual Overlap with LLM Council

DimensionLLM CouncilQuint Code/FPFAlignment
Multi-perspective4 models provide diverse viewpointsMultiple competing hypothesesHIGH
Quality AssurancePeer review + Borda count rankingTrust scoring + WLNK modelHIGH
Bias DetectionADR-015 bias auditingAnchoring bias detectionHIGH
Decision ArtifactsAggregate rankings + synthesisDesign Rationale RecordsMEDIUM
Temporal ValidityPer-session (ephemeral)Evidence decay trackingLOW
Knowledge LevelsFlat (all responses equal)Hierarchical (L0→L1→L2)LOW

Key Differences

AspectLLM CouncilQuint Code/FPF
Execution ModeRuntime (query-time)Development-time (persistent)
FocusAnswer synthesisDecision documentation
VerificationPeer agreementLogical + empirical proof
StorageEphemeral (per-session)Persistent knowledge base
Trust ModelVote aggregationWeakest-link chain

Decision

Implement a Bidirectional Integration where LLM Council enhances Quint Code's hypothesis generation and Quint Code's trust model enhances council decision confidence.

Proposed Integration Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│ INTEGRATION LAYER: "Principled Council" │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ QUINT CODE │ │ LLM COUNCIL │ │
│ │ (Structured │◄────────►│ (Multi-Model │ │
│ │ Reasoning) │ │ Consensus) │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Abduction Phase │ │ Stage 1: Collection │ │
│ │ - Use council for │◄─────────│ - Multiple models │ │
│ │ hypothesis gen │ │ generate options │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Deduction Phase │ │ Stage 2: Peer Review│ │
│ │ - Verify logic via │◄─────────│ - Cross-validate │ │
│ │ council critique │ │ reasoning │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Trust Scoring │─────────►│ Confidence Weights │ │
│ │ - WLNK model │ │ - Apply to rankings │ │
│ │ - Evidence chain │ │ - Qualify synthesis │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Integration Points

1. Council-Powered Hypothesis Generation

Replace Quint Code's single-model abduction with council-based generation:

# Current: Single model generates hypotheses
hypotheses = await model.generate_hypotheses(problem)

# Proposed: Council generates diverse hypotheses
async def council_abduction(problem: str) -> List[Hypothesis]:
"""Use LLM Council for hypothesis generation phase."""
result = await run_council_with_fallback(
f"Generate 3-5 competing hypotheses for: {problem}. "
"Each hypothesis should represent a distinct approach."
)

# Extract hypotheses from each model's response
hypotheses = []
for response in result["stage1_responses"]:
hypotheses.extend(parse_hypotheses(response))

# Deduplicate and return with provenance
return deduplicate_with_provenance(hypotheses)

Benefits:

  • Model diversity prevents anchoring bias
  • Each hypothesis comes with model provenance
  • Natural competition between approaches

2. Council-Assisted Deduction Verification

Use peer review for logical verification:

async def council_verify(hypothesis: Hypothesis) -> VerificationResult:
"""Use council peer review for logical verification."""
result = await run_council_with_fallback(
f"Verify logical consistency of this hypothesis:\n"
f"{hypothesis.content}\n\n"
"Check for: constraint violations, type errors, "
"implicit assumptions, edge cases."
)

# Unanimous agreement required for L1 promotion
if result["consensus_type"] == "unanimous":
return VerificationResult(passed=True, level="L1")
elif result["consensus_type"] == "majority":
return VerificationResult(passed=True, level="L1",
caveats=result["dissent_summary"])
else:
return VerificationResult(passed=False,
reasons=result["disagreements"])

3. Trust-Weighted Council Rankings

Apply FPF's trust formula to council rankings:

@dataclass
class TrustWeightedRanking:
model: str
raw_score: float
trust_weight: float # From FPF ⟨F, G, R⟩
weighted_score: float

def apply_trust_weights(
rankings: List[Ranking],
evidence_chain: EvidenceChain
) -> List[TrustWeightedRanking]:
"""Apply weakest-link trust model to council rankings."""

for ranking in rankings:
# F = Formality (how rigorous was the evaluation)
formality = calculate_formality(ranking.evaluation_text)

# G = Granularity (scope of claims made)
granularity = calculate_granularity(ranking.claims)

# R = Reliability (evidence backing)
reliability = evidence_chain.weakest_link_score()

# Trust = min(F, G, R) per WLNK model
trust = min(formality, granularity, reliability)

ranking.trust_weight = trust
ranking.weighted_score = ranking.raw_score * trust

return sorted(rankings, key=lambda r: r.weighted_score, reverse=True)

4. Design Rationale Records for Council Decisions

Generate DRRs from council consensus:

@dataclass
class DesignRationaleRecord:
decision_id: str
timestamp: datetime
question: str
winning_hypothesis: str
alternatives_considered: List[str]
evidence_chain: List[Evidence]
council_rankings: List[Ranking]
consensus_type: str # unanimous, majority, split
trust_score: float
valid_until: datetime # Evidence expiry

def generate_drr(council_result: CouncilResult) -> DesignRationaleRecord:
"""Convert council result to Design Rationale Record."""
return DesignRationaleRecord(
decision_id=f"DRR-{uuid4()}",
timestamp=datetime.utcnow(),
question=council_result["query"],
winning_hypothesis=council_result["synthesis"]["response"],
alternatives_considered=[
r["response"] for r in council_result["stage1_responses"]
],
evidence_chain=extract_evidence(council_result),
council_rankings=council_result["aggregate_rankings"],
consensus_type=determine_consensus_type(council_result),
trust_score=calculate_trust(council_result),
valid_until=calculate_expiry(council_result),
)

Knowledge Level Mapping

Map Quint Code's L0→L1→L2 to council consensus levels:

Quint LevelDescriptionCouncil Equivalent
L0 (Raw)Unverified hypothesisSingle model response
L1 (Verified)Logically consistentMajority consensus
L2 (Validated)Empirically provenUnanimous + external validation
def council_result_to_knowledge_level(result: CouncilResult) -> str:
"""Map council consensus to FPF knowledge level."""
rankings = result["aggregate_rankings"]
top_score = rankings[0]["score"] if rankings else 0

if result["consensus_type"] == "unanimous" and top_score > 0.9:
return "L2" # High confidence, validated
elif result["consensus_type"] in ("unanimous", "majority"):
return "L1" # Logically verified
else:
return "L0" # Raw hypothesis

Alternatives Considered

Alternative 1: Replace Council with Quint Code Entirely

Rejected: Quint Code is development-time focused; LLM Council is runtime-focused. They serve complementary purposes.

Alternative 2: No Integration (Use Separately)

Rejected: Significant synergy opportunities missed. Both systems address quality and bias but from different angles.

Alternative 3: Quint Code as Council Pre-processor Only

Rejected: Loses the value of FPF's trust model for enhancing council confidence scoring.


Implementation Phases

Phase 1: Evaluation (2 weeks)

  • Benchmark council-based hypothesis generation vs. single-model
  • Measure diversity improvement in abduction phase
  • Test trust-weighted ranking quality

Phase 2: Council-Powered Abduction (3 weeks)

  • Implement /q1-hypothesize-council command
  • Add provenance tracking for council-generated hypotheses
  • Update Quint Code's L0 storage format

Phase 3: Trust-Weighted Rankings (2 weeks)

  • Implement WLNK trust calculator for council
  • Add trust scores to council metadata
  • Create LLM_COUNCIL_TRUST_MODEL=wlnk config option

Phase 4: DRR Generation (2 weeks)

  • Implement Design Rationale Record generator
  • Add DRR storage to .quint/decisions/
  • Create decay detection for council-based decisions

Risks and Mitigations

RiskLikelihoodImpactMitigation
Latency increase from council in abductionHighMediumCache similar hypotheses, async generation
Trust model complexityMediumMediumStart with simplified F-G-R calculation
Dual system maintenance burdenMediumHighClear interface boundaries, optional integration
Knowledge level mapping mismatchLowMediumConservative defaults, explicit override option

Success Metrics

MetricTargetMeasurement
Hypothesis diversity+40% unique approachesCompare single-model vs. council
Anchoring bias reduction-60% first-hypothesis winsTrack which hypothesis wins
Decision confidence+25% trust scoresBefore/after trust model
DRR completeness100% decisions documentedAudit trail coverage

Configuration Options

# Enable council-based hypothesis generation
LLM_COUNCIL_QUINT_INTEGRATION=true

# Trust model for rankings
LLM_COUNCIL_TRUST_MODEL=wlnk|simple|none # default: simple

# Generate Design Rationale Records
LLM_COUNCIL_GENERATE_DRR=true

# DRR storage location
LLM_COUNCIL_DRR_PATH=.quint/decisions/

Council Review Summary

Status: APPROVE WITH MODIFICATIONS

Reviewed by: Gemini 3 Pro (34s), Claude Opus 4.5 (44s), Grok-4 (70s) GPT-5.2-pro: timeout (120s)

Council Verdict: Unanimous approval with significant architectural modifications. The integration is "architecturally sound but overengineered in its current form."


Consensus Analysis

1. Does Council-Based Abduction Improve Hypothesis Diversity?

Verdict: Conditionally Yes

Council does NOT automatically guarantee diversity—models often collapse into safe consensus based on shared training data.

Required Modifications:

  • Role-Based Prompting: Assign specific roles (Scientist, Historian, Logician) to different models
  • Adversarial Seeding: Require at least one model to argue against emerging consensus
  • Model Heterogeneity: Mix model families (GPT + Claude + Llama) rather than same-family instances
  • Diversity Metrics: Measure semantic distance between hypotheses, not just count
# Council-recommended diversity enforcement
class DiversityEnforcedCouncil:
def generate_hypotheses(self, problem: str) -> List[Hypothesis]:
# Assign adversarial roles
roles = ["primary_proposer", "devil_advocate", "synthesis_agent"]

# Enforce minimum variance via KL-divergence threshold
hypotheses = self.collect_with_roles(problem, roles)

if semantic_variance(hypotheses) < DIVERSITY_THRESHOLD:
return self.force_divergence(hypotheses)

return hypotheses

2. Is FPF Trust Model (F-G-R) Applicable to Multi-Model Consensus?

Verdict: Partially—Requires Reinterpretation

ComponentSingle-Model MeaningCouncil Reinterpretation
Fidelity (F)Accuracy to sourceInter-model agreement on factual claims
Groundedness (G)Traceability to evidenceConvergent citation of same sources
Robustness (R)Stability under perturbationConsistency across prompt variations

Critical Issue: WLNK is Problematic

The pure Weakest-Link model WLNK = min(F, G, R) becomes excessively conservative in council contexts:

WLNK_council = min(min(F_i), min(G_i), min(R_i)) for all models i

This "double-minimum" means a single model's low score tanks the entire output.

Council-Recommended Alternative: Robust Aggregate Trust (RAT)

def calculate_rat(wlnk_scores: List[float], disagreement: float) -> float:
"""Replace pure WLNK with Robust Aggregate Trust."""
α, β, γ = 0.5, 0.3, 0.2 # Weights

geometric_mean = prod(wlnk_scores) ** (1/len(wlnk_scores))
median_score = median(wlnk_scores)
min_score = min(wlnk_scores)

coherence_bonus = 1 + 0.2 * (1 - disagreement)

return (α * geometric_mean + β * median_score + γ * min_score) * coherence_bonus

Tiered Trust Application:

  • L0 (Facts): Use strict WLNK—any factual error breaks the chain
  • L1 (Inferences): Use weighted aggregation—allow outvoting of weak links
  • L2 (Hypotheses): Use RAT—preserve diversity, don't force premature consensus

3. Should Knowledge Levels (L0-L2) Map to Consensus Types?

Verdict: Yes—Strongest Part of the Proposal

LevelDefinitionConsensus RequirementValidation Method
L0Data/ObservationStrict Unanimity (5/5)RAG/API, not just voting
L1Patterns/InferenceMajority Vote (4/5+)Explicit reasoning chains
L2Theories/HypothesisPlurality (3/5)Preserve alternatives

Key Insight: For L2, diversity is preferred over consensus—goal is generating options, not picking a winner prematurely.

4. Risks Underestimated

RiskSeverityCouncil Mitigation
Latency CascadeHIGHImplement tiered invocation (not full council every query)
Attribution CollapseHIGHTag every claim with model_id and consensus_score
Shared HallucinationHIGHConsensus ≠ Truth; add citation requirements
Cost ScalingMEDIUM7× cost requires explicit thresholds
Prompt Injection AmplificationMEDIUMCouncil may "launder" malicious output through consensus

Council Architectural Recommendations

1. Tiered Invocation Strategy (Not Full Council by Default)

┌─────────────────────────────────────────────────────────┐
│ Query Classifier │
│ (complexity, stakes, domain novelty) │
└─────────────────┬───────────────────────────────────────┘

┌─────────────┼─────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Fast │ │ Verify │ │ Full │ │ Deep │
│ Path │ │ Path │ │ Council │ │ Council │
│(1 LLM)│ │(2 LLMs) │ │(5 LLMs) │ │(5+synth)│
└───────┘ └─────────┘ └─────────┘ └─────────┘
<2s 3-5s 8-12s 15-30s

Mapping to Quint Levels:

  • Fast Lane (L0): Single Model + RAG verification (facts)
  • Medium Lane (L1): 3-Model Vote (pattern validation)
  • Slow Lane (L2): Full Council + FPF scoring (hypothesis generation)

2. Preserve Model Attribution in DRRs

design_rationale_record:
query_id: "quint-hypothesis-001"
timestamp: "2025-12-19T14:32:00Z"

contributions:
- model: "claude-opus-4.5"
role: "primary_synthesis"
claims: ["hypothesis_A", "constraint_check_passed"]
confidence: 0.82

- model: "gemini-3-pro"
role: "adversarial_reviewer"
dissents: ["edge_case_unhandled"]
confidence: 0.71

weakest_link_identified: "Assumption that clocks were synced"
trust_score:
fidelity: 0.78
groundedness: 0.85
robustness: 0.71
composite_rat: 0.77

3. Circuit Breakers for Consensus Failure

class ConsensusCircuitBreaker:
def evaluate(self, responses: List[ModelResponse]) -> Action:

# Irreconcilable disagreement → escalate
if semantic_variance(responses) > DIVERGENCE_THRESHOLD:
return Action.ESCALATE_TO_HUMAN

# Suspicious unanimity (possible shared hallucination)
if agreement_score(responses) > 0.98 and groundedness < 0.5:
return Action.REQUEST_CITATIONS

# Potential prompt injection pattern
if anomaly_score(responses) > INJECTION_THRESHOLD:
return Action.QUARANTINE_AND_REVIEW

return Action.PROCEED_TO_SYNTHESIS

4. The "Fact-Rule" Split

  • L0 (Facts): Do NOT use LLMs to verify facts if possible. Use deterministic code/API or RAG lookups.
  • L1/L2 (Logic/Game): This is the sweet spot for the LLM Council.

Implementation Revision (Council-Informed)

PhaseOriginalCouncil Revision
Phase 1EvaluationPrototype L1-only council (lowest risk)
Phase 2Council-Powered AbductionBenchmark diversity with/without adversarial seeding
Phase 3Trust-Weighted RankingsReplace WLNK with RAT for L1/L2
Phase 4DRR GenerationImplement attribution schema before production

Rollback Triggers (Council-Defined)

automatic_rollback:
diversity:
- semantic_variance < 0.3 # Hypotheses too similar
latency:
- p99_response_time > 15s
trust:
- shared_hallucination_detected: true
- groundedness < 0.5 with consensus > 0.95
attribution:
- untraced_claims_ratio > 10%

References