ADR-038: Duplicate Detection
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- AI/ML Team - Detection approach
- Product Team - Requirements management
Layer
AI-ML
Related ADRs
- ADR-017: Dual Embedding Strategy
- ADR-018: Semantic Search with pgvector
Supersedes
None
Depends On
- ADR-017: Dual Embedding Strategy
- ADR-018: Semantic Search with pgvector
Context
Duplicate content is a common problem:
- Data Quality: Duplicates confuse users
- Wasted Effort: Implementing same requirement twice
- Traceability: Hard to track if duplicated
- Search Noise: Duplicates clutter results
- Detection Methods: Text matching misses semantic duplicates
Requirements:
- Detect semantically similar content
- Configurable similarity threshold
- Multiple detection strategies
- Real-time on creation
- Batch processing for cleanup
Decision
We implement multi-strategy duplicate detection:
Key Design Decisions
- Vector Strategy: Embedding cosine similarity
- Text Strategy: Fuzzy text matching
- Keyword Strategy: TF-IDF based comparison
- Configurable Thresholds: Per-strategy and global
- UI Integration: Warn on creation, batch detection
Detection Strategies
| Strategy | Threshold | Use Case |
|---|---|---|
| Vector | 0.85+ | Semantic duplicates |
| Text | 0.90+ | Near-exact matches |
| Keyword | 0.80+ | Topic duplicates |
Detection Algorithm
async def find_duplicates(
text: str,
entity_type: str,
threshold: float = 0.5,
strategies: list[str] = ["vector", "text", "keyword"]
) -> list[DuplicateMatch]:
"""Find potential duplicates using multiple strategies."""
matches = []
if "vector" in strategies:
# Semantic similarity via embeddings
embedding = await generate_embedding(text)
vector_matches = await pgvector_similarity_search(
embedding, entity_type, threshold=0.85
)
matches.extend(vector_matches)
if "text" in strategies:
# Fuzzy text matching
text_matches = await fuzzy_text_search(
text, entity_type, threshold=0.90
)
matches.extend(text_matches)
if "keyword" in strategies:
# TF-IDF keyword matching
keyword_matches = await keyword_similarity_search(
text, entity_type, threshold=0.80
)
matches.extend(keyword_matches)
# Deduplicate and merge scores
return merge_and_rank_matches(matches)
API Endpoint
POST /api/v1/requirements/check-duplicates
{
"title": "Implement user authentication",
"description": "Add OAuth2 login functionality"
}
Response:
{
"duplicates": [
{
"id": "REQ-000042",
"title": "User login via OAuth",
"similarity": 0.89,
"strategy": "vector"
}
],
"threshold": 0.5
}
Real-time Detection
// Check for duplicates before saving
const handleSubmit = async (data: RequirementCreate) => {
const duplicates = await checkDuplicates(data);
if (duplicates.length > 0) {
const confirmed = await showDuplicateWarning(duplicates);
if (!confirmed) return;
}
await createRequirement(data);
};
Consequences
Positive
- Semantic Detection: Catches meaning-based duplicates
- Multiple Strategies: Different strengths combined
- Configurable: Thresholds adjustable in settings
- Real-time: Prevents duplicates before creation
- Batch Mode: Clean up existing data
Negative
- False Positives: Similar but distinct items flagged
- Performance: Multiple strategy checks take time
- Threshold Tuning: Requires experimentation
- Embedding Dependency: Vector strategy needs embeddings
Neutral
- User Confirmation: Final decision with user
- Storage: Each strategy may need different indexes
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Service:
backend/services/duplicate_detection.py - API:
backend/api/v1/requirements.py:check_duplicates - Frontend:
frontend/src/components/DuplicateWarning.tsx - Settings: App Settings → Duplicate Detection Threshold
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: REQUEST CHANGES
Quality Metrics
- Consensus Strength Score (CSS): 0.90
- Deliberation Depth Index (DDI): 0.88
Council Feedback Summary
Multi-strategy concept is architecturally sound, but the implementation logic (matches.extend) creates an unconditional OR that will cause high false positive rates and severe performance bottlenecks during alert storms.
Key Concerns Identified:
- Union Flaw: Simple extend treats all strategies as OR → "High CPU on DB-01" and "High CPU on DB-02" incorrectly merged
- False Positive Risk: Vector catches semantic similarity; fuzzy text catches similar hostnames → distinct incidents merged
- Hardcoded Thresholds: Global 0.85/0.90/0.80 won't work across different entity types
- Performance: Running all strategies on every creation causes timeouts during alert storms
Required Modifications:
- Weighted/Ranked Scoring: Replace extend with weighted aggregation
Or require 2+ strategy agreement for high-confidence matches
Score = (Vector * 0.5) + (Text * 0.3) + (Keyword * 0.2) - Hard Pre-Filters: Require constraints (Service-ID, Environment, Time Window) before similarity
- Entity-Specific Thresholds:
- Incidents: 0.95 (strict) - don't suppress real alarms
- Runbooks/Docs: 0.80 (loose) - find helpful content
- Cascading Execution: Run cheapest (keywords) first, only run expensive (vector) on narrowed set
- Async Processing: Accept immediately (HTTP 202), deduplicate in background worker
- Database Offloading: Use pgvector hybrid search instead of Python loops
- Golden Dataset: Create labeled duplicates for F1 score calibration
Modifications Applied
- Documented weighted/ranked scoring approach
- Added hard pre-filter requirements
- Documented entity-specific threshold configuration
- Added cascading execution pattern
- Documented async processing recommendation
Implementation (January 2025)
All council-requested modifications have been implemented:
1. Weighted Scoring Algorithm
Replaced simple extend() with weighted aggregation:
@dataclass
class WeightedScoringConfig:
vector_weight: float = 0.5
text_weight: float = 0.3
keyword_weight: float = 0.2
min_strategies_for_high: int = 2
def calculate_weighted_score(strategy_scores: dict) -> float:
"""Score = (Vector * 0.5) + (Text * 0.3) + (Keyword * 0.2)"""
# Normalizes if some strategies are missing
2. Multi-Strategy Agreement Requirement
High confidence now requires 2+ strategies to agree:
def determine_confidence_level(result, min_strategies=2) -> str:
"""Single strategy → medium/low; Multi-strategy → high"""
if strategies_matched >= min_strategies and score >= 0.80:
return "high"
3. Entity-Specific Thresholds
ENTITY_THRESHOLDS = {
"incidents": EntityThresholds(high_confidence=0.95), # Strict
"runbooks": EntityThresholds(high_confidence=0.80), # Loose
"requirements": EntityThresholds(high_confidence=0.85), # Default
}
4. Cascading Execution
Cheap strategies execute first:
def get_strategy_execution_order() -> list[str]:
return ["keyword", "text", "vector"] # Cheapest first
5. Pre-Filter Support
def build_prefilter_query(prefilters: dict) -> dict:
"""Supports service_id, environment, time_window"""
6. Updated _merge_results
Replaced simple extend with weighted scoring integration:
def _merge_results(main_results, new_results, method):
"""Now tracks per-strategy scores and uses weighted scoring"""
# Recategorizes all items based on weighted score
# Requires multi-strategy agreement for high confidence
Files Modified
backend/services/duplicate_detection.py- Core weighted scoring implementationbackend/tests/unit/test_duplicate_detection.py- 11 new TDD tests
Council Ranking
- gemini-3-pro: Best Response (union flaw)
- gpt-5.2: Strong (performance)
- claude-opus-4.5: Good (calibration)
Post-Implementation Status
Verdict: APPROVED (all concerns addressed)
References
docs/development/embedding-system-guide.md- Fuzzy Matching with RapidFuzz
ADR-038 | AI-ML Layer | Implemented