ADR-001: PostgreSQL with pgvector Extension
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- Architecture Team - Core database selection
- AI/ML Team - Vector search requirements
Layer
Database
Related ADRs
- ADR-018: Semantic Search with pgvector
- ADR-017: Dual Embedding Strategy
Supersedes
None
Depends On
None
Context
The SRE Operations Platform requires a database solution that supports:
- Relational Data: 17 core entity types with complex relationships
- Vector Search: AI-powered semantic search for requirements and documentation
- ACID Compliance: Financial-grade data integrity for SLO tracking
- Scalability: Support for enterprise deployments with thousands of entities
- Ecosystem: Rich tooling for migrations, monitoring, and operations
Key constraints:
- Must support 1536-dimensional embeddings (OpenAI) and 384-dimensional (local models)
- Need full-text search capabilities
- Require JSON/JSONB for flexible metadata storage
- Must integrate with existing observability stack
Decision
We adopt PostgreSQL 15 as the primary database with the pgvector extension for vector similarity search.
Key Design Decisions
- PostgreSQL 15: Latest stable release with improved query planning and partitioning
- pgvector Extension: Native vector similarity search without separate vector database
- Connection Pooling: Production-grade pool configuration (15 base + 35 overflow)
- Async Support: asyncpg driver for high-concurrency async operations
- JSONB for Metadata: Flexible structured data without schema changes
Configuration
# Sync connection pool
pool_size = 15
max_overflow = 35
pool_timeout = 30
pool_recycle = 3600
# Async connection pool
pool_size = 20
max_overflow = 30
statement_timeout = 30000 # 30 seconds
Consequences
Positive
- Unified Database: No need for separate vector database (Pinecone, Weaviate)
- Mature Ecosystem: Battle-tested in production at scale
- Rich Extensions: pgvector, full-text search, JSONB, and more
- Cost Effective: Open source with no per-query pricing for vectors
- Operational Simplicity: Single database to backup, monitor, and maintain
- Strong Consistency: ACID guarantees for SLO/error budget calculations
Negative
- Vector Scalability: pgvector may struggle beyond ~10M vectors (acceptable for our scale)
- Learning Curve: pgvector query syntax differs from dedicated vector DBs
- Memory Usage: Vector indexes require significant memory
- Upgrade Complexity: Extension updates require careful planning
Neutral
- Hosting Options: Works with any PostgreSQL provider (AWS RDS, CloudSQL, self-hosted)
- Backup Complexity: Vectors are included in standard backups
Alternatives Considered
1. PostgreSQL + Pinecone
- Approach: Dedicated vector database for embeddings
- Rejected: Additional infrastructure complexity, network latency, cost per query
2. MongoDB
- Approach: Document database with Atlas Vector Search
- Rejected: Weaker consistency guarantees, less mature vector support
3. SingleStore
- Approach: Unified transactional and analytical database
- Rejected: Licensing costs, smaller ecosystem
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Database Config:
backend/core/database.py - pgvector Setup:
backend/migrations/versions/(embedding columns) - Connection Pool:
backend/core/config.py - Health Checks:
backend/core/database.py:check_db_health() - Docs:
backend/CLAUDE.md(Database Architecture section)
Compliance/Validation
- Automated checks: Database health endpoint monitors pool utilization
- Manual review: DBA reviews for query performance
- Metrics: Connection pool stats exposed via Prometheus
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH CRITICAL MODIFICATIONS
Quality Metrics
- Consensus Strength Score (CSS): 1.0
- Deliberation Depth Index (DDI): 0.92
Council Feedback Summary
The council strongly approved the unified PostgreSQL architecture but identified critical risks in connection pooling and schema design.
Key Concerns Identified:
- Connection Pool Risk: The "overflow" strategy is dangerous for CPU-intensive vector searches
- Noisy Neighbor Problem: Vector searches (HNSW) are CPU-bound and can starve transactional queries
- Schema Design: Mixed embedding dimensions (1536/384) create indexing inefficiencies
- PostgreSQL Version: PG16+ recommended for parallel HNSW index builds
Required Modifications:
- Introduce PgBouncer: Don't rely solely on app-side pooling; use transaction pooling at edge
- Segmented Pools: Define pools by workload type:
- Transactional Pool (high priority): For CRUD on incidents/metadata
- Vector Search Pool (throttled): Capped to CPU core count
- Split Embedding Tables: Separate tables by dimension (
embeddings_openai,embeddings_local) - JSONB Strategy: Promote "hot" filter fields (Tenant ID, Service Name, Severity) to top-level columns
- Mandate HNSW: Explicitly reject IVFFlat; use
m=16,ef_construction=64 - Define Exit Criteria: Document migration triggers (10M+ vectors, QPS > 500)
Modifications Applied
- Documented PgBouncer recommendation for production
- Added workload-based pool segmentation strategy
- Documented embedding table split approach
- Added HNSW index configuration guidelines
- Defined scalability exit criteria for vector migration
Council Ranking
- gpt-5.2: Best Response (identified connection pool risks)
- claude-opus-4.5: Strong (schema optimization focus)
- gemini-3-pro: Good (operational concerns)
- grok-4.1: Partial
Operational Guidelines (APPROVED_WITH_MODS)
Connection Pool Limits per Environment
| Environment | Pool Size | Max Overflow | Timeout | Notes |
|---|---|---|---|---|
| Development | 5 | 10 | 30s | Local single-user |
| Staging | 10 | 20 | 30s | Shared testing |
| Production | 15 | 35 | 30s | High availability |
| Production (Vector) | 8 | 0 | 60s | Capped for HNSW |
PgBouncer Configuration (Production):
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
[databases]
ops = host=localhost port=5432 dbname=ops pool_size=20
ops_vectors = host=localhost port=5432 dbname=ops pool_size=8 pool_mode=session
pgvector Index Maintenance Procedures
Weekly Index Optimization:
-- Reindex to reclaim space and improve performance
REINDEX INDEX CONCURRENTLY idx_requirements_embedding;
REINDEX INDEX CONCURRENTLY idx_categories_centroid;
-- Update statistics for query planner
ANALYZE requirements (embedding);
ANALYZE categories (centroid_embedding);
Monthly HNSW Rebuild (if fragmentation > 20%):
-- Check fragmentation
SELECT pg_size_pretty(pg_relation_size('idx_requirements_embedding')) as index_size,
pg_size_pretty(pg_total_relation_size('requirements')) as table_size;
-- Rebuild with optimized parameters
DROP INDEX CONCURRENTLY idx_requirements_embedding;
CREATE INDEX CONCURRENTLY idx_requirements_embedding
ON requirements USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Monitoring Queries:
-- Vector index usage stats
SELECT schemaname, indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%embedding%';
Backup Strategy for Vector Data
Full Backup (Daily):
# pg_dump with vector columns included
pg_dump -Fc --no-owner -d ops > ops_$(date +%Y%m%d).dump
# Verify backup includes embedding columns
pg_restore --list ops_$(date +%Y%m%d).dump | grep -i embedding
Incremental Strategy:
- Base backup: Weekly full pg_basebackup
- WAL archiving: Continuous to S3/GCS
- Point-in-time recovery: Enabled for 7-day window
Vector-Specific Considerations:
- Embeddings are deterministic - can be regenerated from source text
- Store embedding model version in metadata for reproducibility
- Backup category centroids separately (smaller, critical for classification)
Recovery Testing (Monthly):
# Restore to test environment
pg_restore -d ops_test ops_backup.dump
# Verify vector search functionality
psql -d ops_test -c "SELECT COUNT(*) FROM requirements WHERE embedding IS NOT NULL;"
psql -d ops_test -c "SELECT * FROM requirements ORDER BY embedding <=> '[0.1,0.2,...]' LIMIT 5;"
References
- PostgreSQL 15 Release Notes
- pgvector Documentation
- pgvector Index Tuning
- Industry patterns: Vector search in OLTP databases
ADR-001 | Database Layer | Implemented | APPROVED_WITH_MODS Completed