Skip to main content

ADR-017: Dual Embedding Strategy

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • AI/ML Team - Embedding strategy
  • Architecture Team - Cost optimization

Layer

AI-ML

  • ADR-001: PostgreSQL with pgvector
  • ADR-018: Semantic Search with pgvector
  • ADR-038: Duplicate Detection

Supersedes

None

Depends On

  • ADR-001: PostgreSQL with pgvector

Context

The SRE Operations Platform uses embeddings for:

  1. Semantic Search: Find similar requirements and docs
  2. Duplicate Detection: Identify potential duplicate entries
  3. AI Categorization: Auto-classify content
  4. RAG Pipeline: Retrieve relevant context for Q&A

Key constraints:

  • OpenAI API has costs and rate limits
  • Offline development needs embedding support
  • Different quality/cost trade-offs
  • Must work with pgvector's dimension handling
  • Need graceful fallback chain

Decision

We implement a dual embedding strategy with automatic fallback:

Key Design Decisions

  1. Primary: OpenAI: text-embedding-ada-002 (1536 dimensions)
  2. Secondary: Local: sentence-transformers all-MiniLM-L6-v2 (384 dimensions)
  3. Fallback: TF-IDF: Simple TF-IDF-like approach
  4. Automatic Selection: Based on API key availability
  5. Dimension Handling: pgvector supports both dimensions

Embedding Cascade

1. OpenAI API available?
├── Yes → Use text-embedding-ada-002 (1536d)
└── No → sentence-transformers available?
├── Yes → Use all-MiniLM-L6-v2 (384d)
└── No → Use TF-IDF fallback

Configuration

# Environment
OPENAI_API_KEY=sk-... # If available, use OpenAI

# Settings
embedding_settings:
primary_model: "text-embedding-ada-002"
secondary_model: "all-MiniLM-L6-v2"
similarity_threshold: 0.5 # Duplicate detection
search_threshold: 0.3 # Semantic search

Model Comparison

ModelDimensionsQualityLatencyCost
OpenAI ada-0021536Excellent~200ms$0.0001/1K tokens
MiniLM-L6-v2384Good~50msFree
TF-IDFVariableBasic~10msFree

Storage Schema

-- pgvector column supports variable dimensions
ALTER TABLE requirements ADD COLUMN embedding vector;
ALTER TABLE requirements ADD COLUMN embedding_model VARCHAR(100);
ALTER TABLE requirements ADD COLUMN embedding_generated_at TIMESTAMP;

Consequences

Positive

  • Cost Control: Local model for development, OpenAI for production
  • Offline Support: Full functionality without internet
  • Graceful Degradation: System works with any model
  • Flexibility: Switch models without schema changes
  • Quality Options: Trade off quality vs cost

Negative

  • Inconsistent Results: Different models give different similarities
  • Reindexing: Changing models requires re-embedding
  • Complexity: Multiple code paths to maintain
  • Storage: Model name must be tracked per embedding

Neutral

  • Quality Difference: ~10-15% accuracy gap between OpenAI and local
  • Index Efficiency: pgvector handles both dimensions well

Alternatives Considered

1. OpenAI Only

  • Approach: Always use OpenAI API
  • Rejected: Offline development impossible, high costs

2. Local Only

  • Approach: Always use sentence-transformers
  • Rejected: Lower quality for production use

3. Dedicated Vector DB (Pinecone)

  • Approach: Separate service for embeddings
  • Rejected: Additional infrastructure, network latency

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Embedding Service: backend/services/embeddings.py
  • Model Selection: backend/services/embeddings.py:get_embedding_model()
  • Configuration: backend/core/config.py
  • pgvector Setup: backend/migrations/
  • Docs: docs/development/embedding-models-explained.md

Compliance/Validation

  • Automated checks: Model consistency verified in tests
  • Manual review: Embedding quality reviewed periodically
  • Metrics: Model usage, embedding latency via Prometheus

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (95%) Verdict: APPROVED

Quality Metrics

  • Consensus Strength Score (CSS): 0.90
  • Deliberation Depth Index (DDI): 0.85

Council Feedback Summary

The council identified a critical flaw: the "automatic fallback" strategy is mathematically invalid for vector search (comparing vectors from different embedding spaces).

Key Concerns Identified:

  1. Dimension Incompatibility: Cannot compare 1536-dim to 384-dim vectors in same search
  2. Automatic Fallback Danger: Silently mixing embeddings corrupts search index
  3. pgvector Constraint: Requires fixed dimensions per column

Required Modifications:

  1. Separate Columns for Dimensions:
    embedding_openai vector(1536)
    embedding_local vector(384)
  2. Explicit Mode Selection: Replace "automatic fallback" with configurable profiles
  3. Fail Loud in Production: Don't silently degrade, raise error if expected model unavailable
  4. Isolated Search Paths: Query only embeddings from matching model

Modifications Applied

  1. Schema supports separate embedding columns per dimension
  2. Environment-based model selection (production vs development)
  3. Added model filtering to all similarity queries
  4. TF-IDF fallback restricted to dev/test environments only

Council Ranking

  • claude-opus-4.5: 0.833 (Best Response)
  • gpt-5.2: 0.778
  • gemini-3-pro: 0.5
  • grok-4.1: 0.0 (contained technical errors)

Operational Guidelines (APPROVED_WITH_MODS)

Cost Monitoring Dashboard

Metrics to Track:

MetricDescriptionAlert Threshold
openai_embedding_requests_totalTotal API calls>10K/day
openai_embedding_tokens_totalTokens processed>1M/day
openai_embedding_cost_usdEstimated cost>$50/day
embedding_latency_p9999th percentile latency>2s
local_model_fallback_rate% using local model>20%

Prometheus Metrics Implementation:

# backend/services/embedding_service.py
from prometheus_client import Counter, Histogram, Gauge

EMBEDDING_REQUESTS = Counter(
'embedding_requests_total',
'Total embedding requests',
['model', 'status']
)

EMBEDDING_TOKENS = Counter(
'embedding_tokens_total',
'Total tokens processed',
['model']
)

EMBEDDING_COST = Counter(
'embedding_cost_usd_total',
'Estimated embedding cost in USD',
['model']
)

EMBEDDING_LATENCY = Histogram(
'embedding_latency_seconds',
'Embedding generation latency',
['model'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)

# Cost calculation (OpenAI pricing as of 2024)
OPENAI_COST_PER_1K_TOKENS = 0.0001 # text-embedding-ada-002

def track_embedding_cost(tokens: int, model: str):
if model == "openai":
cost = (tokens / 1000) * OPENAI_COST_PER_1K_TOKENS
EMBEDDING_COST.labels(model=model).inc(cost)

Grafana Dashboard Panels:

  1. Cost Overview: Daily/weekly/monthly spend
  2. Request Volume: Requests per hour by model
  3. Latency Distribution: P50/P95/P99 by model
  4. Fallback Rate: Local vs OpenAI usage
  5. Error Rate: Failed requests by error type

Fallback Triggers

Automatic Fallback Conditions:

# backend/services/embedding_service.py
from enum import Enum

class FallbackReason(Enum):
API_ERROR = "api_error"
RATE_LIMITED = "rate_limited"
TIMEOUT = "timeout"
COST_LIMIT = "cost_limit"
NO_API_KEY = "no_api_key"
MAINTENANCE = "maintenance"

class EmbeddingService:
def __init__(self):
self.daily_cost_limit = settings.embedding_daily_cost_limit # e.g., $100
self.daily_cost_tracker = 0.0
self.fallback_until: datetime | None = None

async def get_embedding(self, text: str) -> list[float]:
# Check cost limit
if self.daily_cost_tracker >= self.daily_cost_limit:
return await self._local_embedding(text, FallbackReason.COST_LIMIT)

# Check maintenance window
if self.fallback_until and datetime.utcnow() < self.fallback_until:
return await self._local_embedding(text, FallbackReason.MAINTENANCE)

try:
embedding = await self._openai_embedding(text)
return embedding
except RateLimitError:
# Exponential backoff fallback
self.fallback_until = datetime.utcnow() + timedelta(minutes=5)
return await self._local_embedding(text, FallbackReason.RATE_LIMITED)
except TimeoutError:
return await self._local_embedding(text, FallbackReason.TIMEOUT)
except Exception as e:
logger.warning(f"OpenAI error, falling back to local: {e}")
return await self._local_embedding(text, FallbackReason.API_ERROR)

Fallback Configuration:

# config/embedding.yaml
embedding:
primary_model: openai
fallback_model: sentence-transformers

fallback_triggers:
rate_limit:
backoff_minutes: 5
max_backoff_minutes: 60

cost_limit:
daily_limit_usd: 100
monthly_limit_usd: 2000
alert_threshold_pct: 80

timeout:
request_timeout_seconds: 10
fallback_on_timeout: true

error_rate:
threshold_pct: 5
window_minutes: 10
fallback_duration_minutes: 15

Manual Fallback Control:

# API endpoint for manual control
@router.post("/admin/embeddings/fallback")
def set_fallback_mode(
enabled: bool,
reason: str,
duration_minutes: int = 60,
):
"""Manually enable/disable fallback mode."""
if enabled:
embedding_service.fallback_until = datetime.utcnow() + timedelta(minutes=duration_minutes)
embedding_service.fallback_reason = FallbackReason.MAINTENANCE
else:
embedding_service.fallback_until = None

References


ADR-017 | AI-ML Layer | Implemented | APPROVED_WITH_MODS Completed