ADR-023: Prometheus + OpenTelemetry Metrics
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- SRE Team - Metrics strategy
- Architecture Team - Observability patterns
Layer
Observability
Related ADRs
- ADR-009: OpenTelemetry Instrumentation
- ADR-043: ClickStack Integration
Supersedes
None
Depends On
- ADR-009: OpenTelemetry Instrumentation
Context
The platform needs comprehensive metrics collection:
- Infrastructure Metrics: CPU, memory, connections
- Application Metrics: Latency, throughput, errors
- Business Metrics: Entity counts, operations
- SRE Metrics: SLIs, error budgets
- Alerting: Prometheus AlertManager integration
Key constraints:
- Prometheus is the alerting backend
- OpenTelemetry for distributed tracing
- Need both pull and push models
- Fixed cardinality to control costs
- Grafana for visualization
Decision
We implement dual metrics collection with Prometheus and OpenTelemetry:
Key Design Decisions
- Prometheus Pull: Infrastructure and high-frequency metrics
- OpenTelemetry Push: Business and custom metrics
- Fixed Cardinality: Controlled label sets
- Four Golden Signals: Latency, traffic, errors, saturation
- SRE Metrics: SLI/SLO tracking built-in
Metrics Architecture
Application
├── Prometheus Endpoint (/metrics)
│ → Prometheus Server
│ → AlertManager
│ → Grafana
│
└── OpenTelemetry SDK
→ OTLP Exporter
→ ClickStack/Collector
→ Grafana
Metric Types
| Metric | Type | Collection | Example |
|---|---|---|---|
| Request latency | Histogram | Prometheus | http_request_duration_seconds |
| Request count | Counter | Prometheus | http_requests_total |
| Active connections | Gauge | Prometheus | db_connections_active |
| Entity operations | Counter | OTEL | ops.entity.created |
| Error budget | Gauge | OTEL | ops.slo.error_budget_remaining |
Four Golden Signals
# Latency - request duration
request_latency = Histogram(
'http_request_duration_seconds',
'Request duration',
['method', 'endpoint', 'status']
)
# Traffic - request rate
request_count = Counter(
'http_requests_total',
'Total requests',
['method', 'endpoint']
)
# Errors - error rate
error_count = Counter(
'http_errors_total',
'Total errors',
['method', 'endpoint', 'error_type']
)
# Saturation - resource usage
db_pool_utilization = Gauge(
'db_pool_utilization_ratio',
'Database pool utilization'
)
Fixed Cardinality
# GOOD - Fixed cardinality
labels = ['entity_type'] # 17 known values
# BAD - High cardinality
labels = ['entity_id'] # Unbounded values
Consequences
Positive
- Complete Coverage: All metric types supported
- Alerting Ready: Prometheus AlertManager integration
- Vendor Neutral: OTEL supports any backend
- Four Golden Signals: SRE best practices
- Grafana Dashboards: Rich visualization
Negative
- Dual System: Two metric pipelines to maintain
- Configuration Complexity: Multiple exporters
- Cardinality Management: Must control labels
- Storage Costs: Metrics can be expensive
Neutral
- Scrape Interval: 15-30s typical for Prometheus
- Push Interval: 60s for OTEL business metrics
Alternatives Considered
1. Prometheus Only
- Approach: All metrics via Prometheus
- Rejected: Less flexible for business metrics, no push model
2. OpenTelemetry Only
- Approach: All metrics via OTEL
- Rejected: Less mature Prometheus integration
3. StatsD
- Approach: Lightweight metrics forwarding
- Rejected: Less feature-rich, no histograms
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Prometheus Metrics:
backend/core/metrics/prometheus_metrics.py - OTEL Metrics:
backend/core/telemetry/metrics.py - Endpoint:
/metrics - Config:
PROMETHEUS_MULTIPROC_DIR,OTEL_* - Docs:
docs/TRACING_CONFIGURATION.md
Compliance/Validation
- Automated checks: Metric endpoint health
- Manual review: Dashboard completeness
- Metrics: Meta-metrics on collection health
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL ACCEPTANCE
Quality Metrics
- Consensus Strength Score (CSS): 0.90
- Deliberation Depth Index (DDI): 0.88
Council Feedback Summary
Foundation (Golden Signals, Grafana, fixed cardinality) is sound, but dual metrics collection introduces unnecessary operational risk if implemented as two distinct pipelines.
Key Concerns Identified:
- 2x Operational Toil: Managing two config languages (Prometheus YAML vs OTEL YAML) and two failure modes
- Data Fragmentation: Correlating infrastructure CPU spike (Prometheus) with Order Volume drop (OTEL) difficult if different formats
- Temporality Trap: OTEL defaults to Delta aggregation; Prometheus requires Cumulative - causes PromQL alert failures
Required Modifications:
- Unified Collector Pattern:
[Apps (Push)] --OTLP--> [OTEL Collector] <--Scrape-- [Infra (Pull)]
|
(Remote Write)
v
[Prometheus Backend] - Configure OTEL for Cumulative: Emit Cumulative metrics for Prometheus compatibility
- Exemplars for High Cardinality: Attach Trace IDs to metric buckets instead of using entity_id labels
- Route Templates: HTTP metrics must use
/api/orders/{id}not raw paths - Cardinality Linting: CI/CD pipeline to reject prohibited labels (user_id, uuid)
Modifications Applied
- Documented unified OTEL Collector pattern
- Added temporality configuration requirement
- Documented Exemplars usage for debugging
- Added cardinality linting recommendation
Council Ranking
- claude-opus-4.5: Best Response (unified pattern)
- gpt-5.2: Strong (temporality analysis)
- gemini-3-pro: Good (exemplars)
References
ADR-023 | Observability Layer | Implemented