Skip to main content

ADR-023: Prometheus + OpenTelemetry Metrics

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • SRE Team - Metrics strategy
  • Architecture Team - Observability patterns

Layer

Observability

  • ADR-009: OpenTelemetry Instrumentation
  • ADR-043: ClickStack Integration

Supersedes

None

Depends On

  • ADR-009: OpenTelemetry Instrumentation

Context

The platform needs comprehensive metrics collection:

  1. Infrastructure Metrics: CPU, memory, connections
  2. Application Metrics: Latency, throughput, errors
  3. Business Metrics: Entity counts, operations
  4. SRE Metrics: SLIs, error budgets
  5. Alerting: Prometheus AlertManager integration

Key constraints:

  • Prometheus is the alerting backend
  • OpenTelemetry for distributed tracing
  • Need both pull and push models
  • Fixed cardinality to control costs
  • Grafana for visualization

Decision

We implement dual metrics collection with Prometheus and OpenTelemetry:

Key Design Decisions

  1. Prometheus Pull: Infrastructure and high-frequency metrics
  2. OpenTelemetry Push: Business and custom metrics
  3. Fixed Cardinality: Controlled label sets
  4. Four Golden Signals: Latency, traffic, errors, saturation
  5. SRE Metrics: SLI/SLO tracking built-in

Metrics Architecture

Application
├── Prometheus Endpoint (/metrics)
│ → Prometheus Server
│ → AlertManager
│ → Grafana

└── OpenTelemetry SDK
→ OTLP Exporter
→ ClickStack/Collector
→ Grafana

Metric Types

MetricTypeCollectionExample
Request latencyHistogramPrometheushttp_request_duration_seconds
Request countCounterPrometheushttp_requests_total
Active connectionsGaugePrometheusdb_connections_active
Entity operationsCounterOTELops.entity.created
Error budgetGaugeOTELops.slo.error_budget_remaining

Four Golden Signals

# Latency - request duration
request_latency = Histogram(
'http_request_duration_seconds',
'Request duration',
['method', 'endpoint', 'status']
)

# Traffic - request rate
request_count = Counter(
'http_requests_total',
'Total requests',
['method', 'endpoint']
)

# Errors - error rate
error_count = Counter(
'http_errors_total',
'Total errors',
['method', 'endpoint', 'error_type']
)

# Saturation - resource usage
db_pool_utilization = Gauge(
'db_pool_utilization_ratio',
'Database pool utilization'
)

Fixed Cardinality

# GOOD - Fixed cardinality
labels = ['entity_type'] # 17 known values

# BAD - High cardinality
labels = ['entity_id'] # Unbounded values

Consequences

Positive

  • Complete Coverage: All metric types supported
  • Alerting Ready: Prometheus AlertManager integration
  • Vendor Neutral: OTEL supports any backend
  • Four Golden Signals: SRE best practices
  • Grafana Dashboards: Rich visualization

Negative

  • Dual System: Two metric pipelines to maintain
  • Configuration Complexity: Multiple exporters
  • Cardinality Management: Must control labels
  • Storage Costs: Metrics can be expensive

Neutral

  • Scrape Interval: 15-30s typical for Prometheus
  • Push Interval: 60s for OTEL business metrics

Alternatives Considered

1. Prometheus Only

  • Approach: All metrics via Prometheus
  • Rejected: Less flexible for business metrics, no push model

2. OpenTelemetry Only

  • Approach: All metrics via OTEL
  • Rejected: Less mature Prometheus integration

3. StatsD

  • Approach: Lightweight metrics forwarding
  • Rejected: Less feature-rich, no histograms

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Prometheus Metrics: backend/core/metrics/prometheus_metrics.py
  • OTEL Metrics: backend/core/telemetry/metrics.py
  • Endpoint: /metrics
  • Config: PROMETHEUS_MULTIPROC_DIR, OTEL_*
  • Docs: docs/TRACING_CONFIGURATION.md

Compliance/Validation

  • Automated checks: Metric endpoint health
  • Manual review: Dashboard completeness
  • Metrics: Meta-metrics on collection health

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: CONDITIONAL ACCEPTANCE

Quality Metrics

  • Consensus Strength Score (CSS): 0.90
  • Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

Foundation (Golden Signals, Grafana, fixed cardinality) is sound, but dual metrics collection introduces unnecessary operational risk if implemented as two distinct pipelines.

Key Concerns Identified:

  1. 2x Operational Toil: Managing two config languages (Prometheus YAML vs OTEL YAML) and two failure modes
  2. Data Fragmentation: Correlating infrastructure CPU spike (Prometheus) with Order Volume drop (OTEL) difficult if different formats
  3. Temporality Trap: OTEL defaults to Delta aggregation; Prometheus requires Cumulative - causes PromQL alert failures

Required Modifications:

  1. Unified Collector Pattern:
    [Apps (Push)] --OTLP--> [OTEL Collector] <--Scrape-- [Infra (Pull)]
    |
    (Remote Write)
    v
    [Prometheus Backend]
  2. Configure OTEL for Cumulative: Emit Cumulative metrics for Prometheus compatibility
  3. Exemplars for High Cardinality: Attach Trace IDs to metric buckets instead of using entity_id labels
  4. Route Templates: HTTP metrics must use /api/orders/{id} not raw paths
  5. Cardinality Linting: CI/CD pipeline to reject prohibited labels (user_id, uuid)

Modifications Applied

  1. Documented unified OTEL Collector pattern
  2. Added temporality configuration requirement
  3. Documented Exemplars usage for debugging
  4. Added cardinality linting recommendation

Council Ranking

  • claude-opus-4.5: Best Response (unified pattern)
  • gpt-5.2: Strong (temporality analysis)
  • gemini-3-pro: Good (exemplars)

References


ADR-023 | Observability Layer | Implemented