ADR-038: Duplicate Detection

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

AI/ML Team - Detection approach
Product Team - Requirements management

Layer

AI-ML

ADR-017: Dual Embedding Strategy
ADR-018: Semantic Search with pgvector

Supersedes

None

Depends On

ADR-017: Dual Embedding Strategy
ADR-018: Semantic Search with pgvector

Context

Duplicate content is a common problem:

Data Quality: Duplicates confuse users
Wasted Effort: Implementing same requirement twice
Traceability: Hard to track if duplicated
Search Noise: Duplicates clutter results
Detection Methods: Text matching misses semantic duplicates

Requirements:

Detect semantically similar content
Configurable similarity threshold
Multiple detection strategies
Real-time on creation
Batch processing for cleanup

Decision

We implement multi-strategy duplicate detection:

Key Design Decisions

Vector Strategy: Embedding cosine similarity
Text Strategy: Fuzzy text matching
Keyword Strategy: TF-IDF based comparison
Configurable Thresholds: Per-strategy and global
UI Integration: Warn on creation, batch detection

Detection Strategies

Strategy	Threshold	Use Case
Vector	0.85+	Semantic duplicates
Text	0.90+	Near-exact matches
Keyword	0.80+	Topic duplicates

Detection Algorithm

async def find_duplicates(
    text: str,
    entity_type: str,
    threshold: float = 0.5,
    strategies: list[str] = ["vector", "text", "keyword"]
) -> list[DuplicateMatch]:
    """Find potential duplicates using multiple strategies."""

    matches = []

    if "vector" in strategies:
        # Semantic similarity via embeddings
        embedding = await generate_embedding(text)
        vector_matches = await pgvector_similarity_search(
            embedding, entity_type, threshold=0.85
        )
        matches.extend(vector_matches)

    if "text" in strategies:
        # Fuzzy text matching
        text_matches = await fuzzy_text_search(
            text, entity_type, threshold=0.90
        )
        matches.extend(text_matches)

    if "keyword" in strategies:
        # TF-IDF keyword matching
        keyword_matches = await keyword_similarity_search(
            text, entity_type, threshold=0.80
        )
        matches.extend(keyword_matches)

    # Deduplicate and merge scores
    return merge_and_rank_matches(matches)

API Endpoint

POST /api/v1/requirements/check-duplicates
{
  "title": "Implement user authentication",
  "description": "Add OAuth2 login functionality"
}

Response:
{
  "duplicates": [
    {
      "id": "REQ-000042",
      "title": "User login via OAuth",
      "similarity": 0.89,
      "strategy": "vector"
    }
  ],
  "threshold": 0.5
}

Real-time Detection

// Check for duplicates before saving
const handleSubmit = async (data: RequirementCreate) => {
  const duplicates = await checkDuplicates(data);

  if (duplicates.length > 0) {
    const confirmed = await showDuplicateWarning(duplicates);
    if (!confirmed) return;
  }

  await createRequirement(data);
};

Consequences

Positive

Semantic Detection: Catches meaning-based duplicates
Multiple Strategies: Different strengths combined
Configurable: Thresholds adjustable in settings
Real-time: Prevents duplicates before creation
Batch Mode: Clean up existing data

Negative

False Positives: Similar but distinct items flagged
Performance: Multiple strategy checks take time
Threshold Tuning: Requires experimentation
Embedding Dependency: Vector strategy needs embeddings

Neutral

User Confirmation: Final decision with user
Storage: Each strategy may need different indexes

Implementation Status

Implementation Details

Service: backend/services/duplicate_detection.py
API: backend/api/v1/requirements.py:check_duplicates
Frontend: frontend/src/components/DuplicateWarning.tsx
Settings: App Settings → Duplicate Detection Threshold

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: REQUEST CHANGES

Quality Metrics

Consensus Strength Score (CSS): 0.90
Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

Multi-strategy concept is architecturally sound, but the implementation logic (matches.extend) creates an unconditional OR that will cause high false positive rates and severe performance bottlenecks during alert storms.

Key Concerns Identified:

Union Flaw: Simple extend treats all strategies as OR → "High CPU on DB-01" and "High CPU on DB-02" incorrectly merged
False Positive Risk: Vector catches semantic similarity; fuzzy text catches similar hostnames → distinct incidents merged
Hardcoded Thresholds: Global 0.85/0.90/0.80 won't work across different entity types
Performance: Running all strategies on every creation causes timeouts during alert storms

Required Modifications:

Weighted/Ranked Scoring: Replace extend with weighted aggregation
```
Score = (Vector * 0.5) + (Text * 0.3) + (Keyword * 0.2)
```
Or require 2+ strategy agreement for high-confidence matches
Hard Pre-Filters: Require constraints (Service-ID, Environment, Time Window) before similarity
Entity-Specific Thresholds:
- Incidents: 0.95 (strict) - don't suppress real alarms
- Runbooks/Docs: 0.80 (loose) - find helpful content
Cascading Execution: Run cheapest (keywords) first, only run expensive (vector) on narrowed set
Async Processing: Accept immediately (HTTP 202), deduplicate in background worker
Database Offloading: Use pgvector hybrid search instead of Python loops
Golden Dataset: Create labeled duplicates for F1 score calibration

Modifications Applied

Documented weighted/ranked scoring approach
Added hard pre-filter requirements
Documented entity-specific threshold configuration
Added cascading execution pattern
Documented async processing recommendation

Implementation (January 2025)

All council-requested modifications have been implemented:

1. Weighted Scoring Algorithm

Replaced simple extend() with weighted aggregation:

@dataclass
class WeightedScoringConfig:
    vector_weight: float = 0.5
    text_weight: float = 0.3
    keyword_weight: float = 0.2
    min_strategies_for_high: int = 2

def calculate_weighted_score(strategy_scores: dict) -> float:
    """Score = (Vector * 0.5) + (Text * 0.3) + (Keyword * 0.2)"""
    # Normalizes if some strategies are missing

2. Multi-Strategy Agreement Requirement

High confidence now requires 2+ strategies to agree:

def determine_confidence_level(result, min_strategies=2) -> str:
    """Single strategy → medium/low; Multi-strategy → high"""
    if strategies_matched >= min_strategies and score >= 0.80:
        return "high"

3. Entity-Specific Thresholds

ENTITY_THRESHOLDS = {
    "incidents": EntityThresholds(high_confidence=0.95),  # Strict
    "runbooks": EntityThresholds(high_confidence=0.80),   # Loose
    "requirements": EntityThresholds(high_confidence=0.85),  # Default
}

4. Cascading Execution

Cheap strategies execute first:

def get_strategy_execution_order() -> list[str]:
    return ["keyword", "text", "vector"]  # Cheapest first

5. Pre-Filter Support

def build_prefilter_query(prefilters: dict) -> dict:
    """Supports service_id, environment, time_window"""

6. Updated _merge_results

Replaced simple extend with weighted scoring integration:

def _merge_results(main_results, new_results, method):
    """Now tracks per-strategy scores and uses weighted scoring"""
    # Recategorizes all items based on weighted score
    # Requires multi-strategy agreement for high confidence

Files Modified

backend/services/duplicate_detection.py - Core weighted scoring implementation
backend/tests/unit/test_duplicate_detection.py - 11 new TDD tests

Council Ranking

gemini-3-pro: Best Response (union flaw)
gpt-5.2: Strong (performance)
claude-opus-4.5: Good (calibration)

Post-Implementation Status

Verdict: APPROVED (all concerns addressed)

References

docs/development/embedding-system-guide.md
Fuzzy Matching with RapidFuzz

ADR-038 | AI-ML Layer | Implemented

Status​

Date​

Decision Makers​

Layer​

Related ADRs​

Supersedes​

Depends On​

Context​

Decision​

Key Design Decisions​

Detection Strategies​

Detection Algorithm​

API Endpoint​

Real-time Detection​

Consequences​

Positive​

Negative​

Neutral​

Implementation Status​

Implementation Details​

LLM Council Review​

Quality Metrics​

Council Feedback Summary​

Key Concerns Identified:​

Required Modifications:​

Modifications Applied​

Implementation (January 2025)​

1. Weighted Scoring Algorithm​

2. Multi-Strategy Agreement Requirement​

3. Entity-Specific Thresholds​

4. Cascading Execution​

5. Pre-Filter Support​

6. Updated _merge_results​

Files Modified​

Council Ranking​

Post-Implementation Status​

References​