From False Positives to Precision: Weighted Duplicate Scoring

February 10, 2026 · 5 min read

Published: 2025-01-16

Duplicate detection sounds simple: find things that are similar. But when your system uses multiple detection strategies, merging their results becomes the hard part. Our original implementation used a simple extend() pattern that created more problems than it solved. Here's how weighted scoring fixed it.

The Problem

Our duplicate detection system used three complementary strategies:

Strategy	Strength	Weakness
Vector	Catches semantic similarity	Expensive (embedding generation)
Text	Catches near-exact wording	Misses paraphrases
Keyword	Fast topic matching	High false positive rate

The original merge logic was straightforward:

def _merge_results(main_results, new_results, method):
    """Just extend the lists"""
    for level in ["high_confidence", "medium_confidence", "low_confidence"]:
        if level in new_results:
            for item in new_results[level]:
                if item["id"] not in seen_ids:
                    main_results[level].append(item)

The problem? This treats all strategies as an unconditional OR. If vector says 0.85 similarity and keyword says 0.60 similarity for the same item, which wins?

With extend(), both get added - or worse, only the first one encountered is kept. We lose the precision that comes from multiple strategies agreeing.

The LLM Council review caught this immediately:

"Simple extend treats all strategies as OR → 'High CPU on DB-01' and 'High CPU on DB-02' incorrectly merged as duplicates. Vector catches semantic similarity; fuzzy text catches similar hostnames → distinct incidents merged."

The verdict was REQUEST CHANGES. We needed weighted scoring.

The Solution

The fix required two conceptual shifts:

Weighted aggregation instead of simple append
Multi-strategy agreement as a confidence signal

Weighted Scoring Algorithm

We combine strategy scores with configurable weights:

@dataclass
class WeightedScoringConfig:
    vector_weight: float = 0.5   # Semantic similarity is most reliable
    text_weight: float = 0.3    # Text matching is secondary
    keyword_weight: float = 0.2  # Keyword is least precise
    min_strategies_for_high: int = 2

def calculate_weighted_score(strategy_scores: dict) -> float:
    """
    Score = (Vector * 0.5) + (Text * 0.3) + (Keyword * 0.2)
    Normalizes if some strategies are missing.
    """
    weights = {
        "vector": self.weighted_config.vector_weight,
        "text": self.weighted_config.text_weight,
        "keyword": self.weighted_config.keyword_weight,
    }

    total_weight = 0.0
    weighted_sum = 0.0

    for strategy, score in strategy_scores.items():
        if strategy in weights:
            weight = weights[strategy]
            weighted_sum += score * weight
            total_weight += weight

    # Normalize if not all strategies were used
    if total_weight > 0:
        return weighted_sum / total_weight
    return 0.0

This means if only vector and text detect a match, we normalize: (0.90 * 0.5 + 0.80 * 0.3) / (0.5 + 0.3) = 0.875.

Multi-Strategy Agreement

A single strategy match should never produce high confidence - too much risk of false positives:

def determine_confidence_level(result, min_strategies_for_high=2):
    """
    High confidence requires multiple strategies to agree.
    Single-strategy matches can only be medium or low.
    """
    strategies_matched = result.get("strategies_matched", 0)
    weighted_score = self.calculate_weighted_score(result.get("scores", {}))

    # Multi-strategy agreement required for high confidence
    if strategies_matched >= min_strategies_for_high and weighted_score >= 0.80:
        return "high"
    elif strategies_matched >= 1 and weighted_score >= 0.65:
        return "medium"
    else:
        return "low"

This simple change dramatically reduces false positives. Even a 0.95 vector similarity can't produce "high" confidence alone.

Entity-Specific Thresholds

Different entity types need different sensitivity:

ENTITY_THRESHOLDS = {
    "incidents": EntityThresholds(
        high_confidence=0.95,    # Strict - don't suppress real alarms
        medium_confidence=0.85,
        low_confidence=0.75,
    ),
    "runbooks": EntityThresholds(
        high_confidence=0.80,    # Loose - find helpful content
        medium_confidence=0.70,
        low_confidence=0.60,
    ),
    "requirements": EntityThresholds(
        high_confidence=0.85,    # Default - balanced
        medium_confidence=0.75,
        low_confidence=0.65,
    ),
}

For incidents, we want almost-exact matches only (0.95). False negatives are better than suppressing real alarms. For runbooks, we're more permissive (0.80) because finding helpful related content is the goal.

Cascading Execution

We also optimized for performance by running cheap strategies first:

def get_strategy_execution_order() -> list[str]:
    return ["keyword", "text", "vector"]  # Cheapest first

Keyword matching is just string comparison - nearly free. Vector requires embedding generation or lookup - expensive. By running keyword first, we can potentially skip the expensive vector check if keyword already disqualifies the match.

Implementation

The updated _merge_results now tracks per-strategy scores:

def _merge_results(main_results, new_results, method):
    """Merge with weighted scoring instead of simple extend."""

    # Initialize tracking dict
    if "_item_scores" not in main_results:
        main_results["_item_scores"] = {}

    for level in ["high_confidence", "medium_confidence", "low_confidence"]:
        for item in new_results.get(level, []):
            item_id = item["id"]

            # Track per-strategy scores for this item
            if item_id not in main_results["_item_scores"]:
                main_results["_item_scores"][item_id] = {
                    "item_data": item,
                    "strategy_scores": {},
                    "strategies_matched": 0,
                }

            # Add score for this strategy
            score = extract_score_by_method(item, method)
            main_results["_item_scores"][item_id]["strategy_scores"][method] = score
            main_results["_item_scores"][item_id]["strategies_matched"] += 1

    # Recategorize based on weighted scoring
    self._recategorize_with_weighted_scoring(main_results)

The key insight: we don't decide the confidence level when processing each strategy. We wait until all strategies have contributed their scores, then calculate the weighted result.

Results

The weighted scoring approach provides:

Metric	Before	After
False Positive Rate	High (~30%)	Low (~5%)
Multi-strategy matches	Not tracked	Prioritized
Entity-specific tuning	None	Full support
Debugging info	Lost	Preserved

Each result now includes full strategy breakdown:

{
  "id": "REQ-001",
  "weighted_score": 0.875,
  "confidence_level": "high",
  "strategies_matched": 2,
  "strategy_scores": {
    "vector": 0.90,
    "text": 0.85
  }
}

This makes debugging straightforward: you can see exactly which strategies contributed and with what scores.

Lessons Learned

Simple OR logic loses precision - When merging results from multiple strategies, preserve the individual contributions.
Agreement is a signal - Two strategies detecting the same match is much stronger evidence than one strategy with a high score.
Different entities need different thresholds - What's "similar enough" for runbooks is too loose for incidents.
Preserve debugging info - The per-strategy scores are invaluable for tuning thresholds and investigating false positives.
Order matters for performance - Run cheap strategies first to enable early termination.

This post details the implementation of ADR-038: Duplicate Detection weighted scoring improvements.

The Problem​

The Solution​

Weighted Scoring Algorithm​

Multi-Strategy Agreement​

Entity-Specific Thresholds​

Cascading Execution​

Implementation​

Results​

Lessons Learned​