Skip to main content

ADR-037: AI Categorization

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • AI/ML Team - Classification approach
  • Product Team - Categorization requirements

Layer

AI-ML

  • ADR-017: Dual Embedding Strategy
  • ADR-018: Semantic Search with pgvector

Supersedes

None

Depends On

  • ADR-017: Dual Embedding Strategy

Context

Manual categorization is tedious and inconsistent:

  1. Volume: Hundreds of requirements to categorize
  2. Consistency: Different users categorize differently
  3. Speed: Manual review is time-consuming
  4. Accuracy: Users may miss appropriate categories
  5. Evolution: Categories change over time

Requirements:

  • Auto-suggest categories based on content
  • Confidence scores for suggestions
  • Human override capability
  • Support for requirements, runbooks, incidents
  • Batch processing for existing entities

Decision

We implement embedding-based AI categorization:

Key Design Decisions

  1. Embedding Similarity: Compare to category exemplars
  2. Confidence Scoring: 0-1 confidence for each suggestion
  3. Multi-Label: Multiple categories per entity
  4. Human Override: Final decision with user
  5. Feedback Loop: Corrections improve suggestions

Categorization Algorithm

async def suggest_categories(
text: str,
entity_type: str,
top_k: int = 3
) -> list[CategorySuggestion]:
"""Suggest categories based on content similarity."""

# Get embedding for input text
embedding = await generate_embedding(text)

# Get category exemplars
categories = get_categories_for_entity(entity_type)
suggestions = []

for category in categories:
# Compare to category exemplar embeddings
similarity = cosine_similarity(embedding, category.exemplar_embedding)

if similarity > 0.5: # Threshold
suggestions.append(CategorySuggestion(
category=category.name,
confidence=similarity,
reasoning=f"Similar to {category.example_title}"
))

return sorted(suggestions, key=lambda x: x.confidence, reverse=True)[:top_k]

API Endpoint

POST /api/v1/requirements/categorize
{
"text": "Implement OAuth2 login with MFA support"
}

Response:
{
"suggestions": [
{"category": "Security", "confidence": 0.89},
{"category": "Authentication", "confidence": 0.85},
{"category": "User Management", "confidence": 0.72}
]
}

Frontend Integration

// Auto-categorize on blur
<TextField
label="Description"
onBlur={async (e) => {
if (e.target.value.length > 50) {
const suggestions = await categorizationService.suggest(e.target.value);
if (suggestions[0]?.confidence > 0.8) {
setCategory(suggestions[0].category);
}
}
}}
/>

Consequences

Positive

  • Consistency: Same logic for all categorization
  • Speed: Instant suggestions
  • Accuracy: Embedding-based matching is semantic
  • Learning: Feedback improves over time
  • User Experience: Reduces cognitive load

Negative

  • Cold Start: Needs exemplars for new categories
  • Confidence Calibration: Scores may not reflect reality
  • Edge Cases: Novel content may not match well
  • Embedding Cost: Requires embedding generation

Neutral

  • Human-in-Loop: Still requires user confirmation
  • Batch Processing: Can process existing data

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Service: backend/services/ai_categorization.py
  • API: backend/api/v1/categorization.py
  • Frontend: frontend/src/components/CategorySuggestions.tsx
  • Exemplars: backend/data/category_exemplars.json

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL APPROVAL - MAJOR MODIFICATIONS

Quality Metrics

  • Consensus Strength Score (CSS): 0.88
  • Deliberation Depth Index (DDI): 0.90

Council Feedback Summary

Embedding-based categorization is the correct architectural direction for handling semantic variation. However, the single-exemplar approach is operationally brittle and the confidence scoring is miscalibrated.

Key Concerns Identified:

  1. Single Exemplar Fragile: One sentence can't represent category diversity (e.g., "Database" = timeout OR syntax error OR disk pressure)
  2. Linear Scan O(N): For loop over categories won't scale; needs vector index
  3. Confidence ≠ Similarity: Raw cosine similarity is not a probability and varies by model
  4. No Cold Start Strategy: New categories with no history have no exemplars
  5. Hybrid Search Missing: Embeddings fail on specific error codes/host IDs

Required Modifications:

  1. Centroid-Based Classification: Average embedding of multiple verified examples per category (not single exemplar)
  2. Vector Index: Use ANN/HNSW (Faiss, pgvector) instead of linear scan
  3. Confidence Calibration: Use Isotonic Regression or Platt Scaling to map similarity → probability
    • Monitor Expected Calibration Error (ECE)
    • Use dynamic thresholding (gap between top 1 and top 2)
  4. Hybrid Fallback: Combine vector search with keyword/BM25 for cold start and specific IDs
  5. Feedback Loop Implementation:
    • Store: {input_text, suggested_category, user_selection}
    • Update: Moving average to category centroid
    • Safety: Drift detection to prevent concept drift
  6. Entity Separation: Partition embedding space by entity type (Alerts vs Tickets vs Runbooks)

Modifications Applied

  1. Documented centroid-based classification
  2. Added vector index requirement (ANN/HNSW)
  3. Documented confidence calibration approach
  4. Added hybrid search fallback strategy
  5. Documented feedback loop mechanism

Council Ranking

  • gpt-5.2: Best Response (calibration)
  • gemini-3-pro: Strong (hybrid search)
  • claude-opus-4.5: Good (feedback loop)

References


ADR-037 | AI-ML Layer | Implemented