ADR-017: Dual Embedding Strategy
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- AI/ML Team - Embedding strategy
- Architecture Team - Cost optimization
Layer
AI-ML
Related ADRs
- ADR-001: PostgreSQL with pgvector
- ADR-018: Semantic Search with pgvector
- ADR-038: Duplicate Detection
Supersedes
None
Depends On
- ADR-001: PostgreSQL with pgvector
Context
The SRE Operations Platform uses embeddings for:
- Semantic Search: Find similar requirements and docs
- Duplicate Detection: Identify potential duplicate entries
- AI Categorization: Auto-classify content
- RAG Pipeline: Retrieve relevant context for Q&A
Key constraints:
- OpenAI API has costs and rate limits
- Offline development needs embedding support
- Different quality/cost trade-offs
- Must work with pgvector's dimension handling
- Need graceful fallback chain
Decision
We implement a dual embedding strategy with automatic fallback:
Key Design Decisions
- Primary: OpenAI: text-embedding-ada-002 (1536 dimensions)
- Secondary: Local: sentence-transformers all-MiniLM-L6-v2 (384 dimensions)
- Fallback: TF-IDF: Simple TF-IDF-like approach
- Automatic Selection: Based on API key availability
- Dimension Handling: pgvector supports both dimensions
Embedding Cascade
1. OpenAI API available?
├── Yes → Use text-embedding-ada-002 (1536d)
└── No → sentence-transformers available?
├── Yes → Use all-MiniLM-L6-v2 (384d)
└── No → Use TF-IDF fallback
Configuration
# Environment
OPENAI_API_KEY=sk-... # If available, use OpenAI
# Settings
embedding_settings:
primary_model: "text-embedding-ada-002"
secondary_model: "all-MiniLM-L6-v2"
similarity_threshold: 0.5 # Duplicate detection
search_threshold: 0.3 # Semantic search
Model Comparison
| Model | Dimensions | Quality | Latency | Cost |
|---|---|---|---|---|
| OpenAI ada-002 | 1536 | Excellent | ~200ms | $0.0001/1K tokens |
| MiniLM-L6-v2 | 384 | Good | ~50ms | Free |
| TF-IDF | Variable | Basic | ~10ms | Free |
Storage Schema
-- pgvector column supports variable dimensions
ALTER TABLE requirements ADD COLUMN embedding vector;
ALTER TABLE requirements ADD COLUMN embedding_model VARCHAR(100);
ALTER TABLE requirements ADD COLUMN embedding_generated_at TIMESTAMP;
Consequences
Positive
- Cost Control: Local model for development, OpenAI for production
- Offline Support: Full functionality without internet
- Graceful Degradation: System works with any model
- Flexibility: Switch models without schema changes
- Quality Options: Trade off quality vs cost
Negative
- Inconsistent Results: Different models give different similarities
- Reindexing: Changing models requires re-embedding
- Complexity: Multiple code paths to maintain
- Storage: Model name must be tracked per embedding
Neutral
- Quality Difference: ~10-15% accuracy gap between OpenAI and local
- Index Efficiency: pgvector handles both dimensions well
Alternatives Considered
1. OpenAI Only
- Approach: Always use OpenAI API
- Rejected: Offline development impossible, high costs
2. Local Only
- Approach: Always use sentence-transformers
- Rejected: Lower quality for production use
3. Dedicated Vector DB (Pinecone)
- Approach: Separate service for embeddings
- Rejected: Additional infrastructure, network latency
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Embedding Service:
backend/services/embeddings.py - Model Selection:
backend/services/embeddings.py:get_embedding_model() - Configuration:
backend/core/config.py - pgvector Setup:
backend/migrations/ - Docs:
docs/development/embedding-models-explained.md
Compliance/Validation
- Automated checks: Model consistency verified in tests
- Manual review: Embedding quality reviewed periodically
- Metrics: Model usage, embedding latency via Prometheus
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (95%) Verdict: APPROVED
Quality Metrics
- Consensus Strength Score (CSS): 0.90
- Deliberation Depth Index (DDI): 0.85
Council Feedback Summary
The council identified a critical flaw: the "automatic fallback" strategy is mathematically invalid for vector search (comparing vectors from different embedding spaces).
Key Concerns Identified:
- Dimension Incompatibility: Cannot compare 1536-dim to 384-dim vectors in same search
- Automatic Fallback Danger: Silently mixing embeddings corrupts search index
- pgvector Constraint: Requires fixed dimensions per column
Required Modifications:
- Separate Columns for Dimensions:
embedding_openai vector(1536)
embedding_local vector(384) - Explicit Mode Selection: Replace "automatic fallback" with configurable profiles
- Fail Loud in Production: Don't silently degrade, raise error if expected model unavailable
- Isolated Search Paths: Query only embeddings from matching model
Modifications Applied
- Schema supports separate embedding columns per dimension
- Environment-based model selection (production vs development)
- Added model filtering to all similarity queries
- TF-IDF fallback restricted to dev/test environments only
Council Ranking
- claude-opus-4.5: 0.833 (Best Response)
- gpt-5.2: 0.778
- gemini-3-pro: 0.5
- grok-4.1: 0.0 (contained technical errors)
Operational Guidelines (APPROVED_WITH_MODS)
Cost Monitoring Dashboard
Metrics to Track:
| Metric | Description | Alert Threshold |
|---|---|---|
| openai_embedding_requests_total | Total API calls | >10K/day |
| openai_embedding_tokens_total | Tokens processed | >1M/day |
| openai_embedding_cost_usd | Estimated cost | >$50/day |
| embedding_latency_p99 | 99th percentile latency | >2s |
| local_model_fallback_rate | % using local model | >20% |
Prometheus Metrics Implementation:
# backend/services/embedding_service.py
from prometheus_client import Counter, Histogram, Gauge
EMBEDDING_REQUESTS = Counter(
'embedding_requests_total',
'Total embedding requests',
['model', 'status']
)
EMBEDDING_TOKENS = Counter(
'embedding_tokens_total',
'Total tokens processed',
['model']
)
EMBEDDING_COST = Counter(
'embedding_cost_usd_total',
'Estimated embedding cost in USD',
['model']
)
EMBEDDING_LATENCY = Histogram(
'embedding_latency_seconds',
'Embedding generation latency',
['model'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0]
)
# Cost calculation (OpenAI pricing as of 2024)
OPENAI_COST_PER_1K_TOKENS = 0.0001 # text-embedding-ada-002
def track_embedding_cost(tokens: int, model: str):
if model == "openai":
cost = (tokens / 1000) * OPENAI_COST_PER_1K_TOKENS
EMBEDDING_COST.labels(model=model).inc(cost)
Grafana Dashboard Panels:
- Cost Overview: Daily/weekly/monthly spend
- Request Volume: Requests per hour by model
- Latency Distribution: P50/P95/P99 by model
- Fallback Rate: Local vs OpenAI usage
- Error Rate: Failed requests by error type
Fallback Triggers
Automatic Fallback Conditions:
# backend/services/embedding_service.py
from enum import Enum
class FallbackReason(Enum):
API_ERROR = "api_error"
RATE_LIMITED = "rate_limited"
TIMEOUT = "timeout"
COST_LIMIT = "cost_limit"
NO_API_KEY = "no_api_key"
MAINTENANCE = "maintenance"
class EmbeddingService:
def __init__(self):
self.daily_cost_limit = settings.embedding_daily_cost_limit # e.g., $100
self.daily_cost_tracker = 0.0
self.fallback_until: datetime | None = None
async def get_embedding(self, text: str) -> list[float]:
# Check cost limit
if self.daily_cost_tracker >= self.daily_cost_limit:
return await self._local_embedding(text, FallbackReason.COST_LIMIT)
# Check maintenance window
if self.fallback_until and datetime.utcnow() < self.fallback_until:
return await self._local_embedding(text, FallbackReason.MAINTENANCE)
try:
embedding = await self._openai_embedding(text)
return embedding
except RateLimitError:
# Exponential backoff fallback
self.fallback_until = datetime.utcnow() + timedelta(minutes=5)
return await self._local_embedding(text, FallbackReason.RATE_LIMITED)
except TimeoutError:
return await self._local_embedding(text, FallbackReason.TIMEOUT)
except Exception as e:
logger.warning(f"OpenAI error, falling back to local: {e}")
return await self._local_embedding(text, FallbackReason.API_ERROR)
Fallback Configuration:
# config/embedding.yaml
embedding:
primary_model: openai
fallback_model: sentence-transformers
fallback_triggers:
rate_limit:
backoff_minutes: 5
max_backoff_minutes: 60
cost_limit:
daily_limit_usd: 100
monthly_limit_usd: 2000
alert_threshold_pct: 80
timeout:
request_timeout_seconds: 10
fallback_on_timeout: true
error_rate:
threshold_pct: 5
window_minutes: 10
fallback_duration_minutes: 15
Manual Fallback Control:
# API endpoint for manual control
@router.post("/admin/embeddings/fallback")
def set_fallback_mode(
enabled: bool,
reason: str,
duration_minutes: int = 60,
):
"""Manually enable/disable fallback mode."""
if enabled:
embedding_service.fallback_until = datetime.utcnow() + timedelta(minutes=duration_minutes)
embedding_service.fallback_reason = FallbackReason.MAINTENANCE
else:
embedding_service.fallback_until = None
References
- OpenAI Embeddings
- OpenAI Pricing
- Sentence Transformers
- pgvector
- Docs:
docs/development/ai-ml-features-guide.md
ADR-017 | AI-ML Layer | Implemented | APPROVED_WITH_MODS Completed