Skip to main content

ADR-009: OpenTelemetry Instrumentation

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • SRE Team - Observability requirements
  • Architecture Team - Integration patterns

Layer

Observability

  • ADR-023: Prometheus + OpenTelemetry Metrics
  • ADR-024: Structured Logging with structlog
  • ADR-043: ClickStack Integration

Supersedes

None

Depends On

  • ADR-008: FastAPI with Pydantic (instrumented application)

Context

The SRE Operations Platform requires comprehensive observability:

  1. Distributed Tracing: Follow requests across services
  2. Metrics Collection: Performance and business metrics
  3. Log Correlation: Connect logs to traces
  4. Vendor Neutrality: Avoid lock-in to specific backends
  5. Standards Compliance: Industry-standard telemetry

Key constraints:

  • Must work with multiple backends (Jaeger, ClickStack, etc.)
  • Need automatic instrumentation for common libraries
  • Require minimal code changes for adoption
  • Must support both backend and frontend
  • Need production-grade performance impact (<5%)

Decision

We adopt OpenTelemetry as the primary instrumentation framework:

Key Design Decisions

  1. OpenTelemetry SDK: Vendor-neutral telemetry collection
  2. Automatic Instrumentation: FastAPI, SQLAlchemy, httpx auto-instrumented
  3. OTLP Export: Standard protocol for all telemetry types
  4. B3 Propagation: Distributed trace context across services
  5. Custom Spans: Business-critical operations manually traced

Backend Configuration

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure tracer
provider = TracerProvider(resource=Resource.create({
"service.name": "ops-backend",
"deployment.environment": settings.environment,
}))

# Add OTLP exporter
otlp_exporter = OTLPSpanExporter(
endpoint=settings.otlp_endpoint,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

Frontend Configuration

import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';

const provider = new WebTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'ops-frontend',
}),
});

// Auto-instrument fetch calls
registerInstrumentations({
instrumentations: [
new FetchInstrumentation(),
new DocumentLoadInstrumentation(),
],
});

Trace Span Types

Span TypePurposeExample
HTTPRequest/responseGET /api/v1/requirements
DBDatabase queriesSELECT requirements
CustomBusiness logicgenerate_embeddings
ExternalThird-party callsOpenAI API

Consequences

Positive

  • Vendor Neutral: Switch backends without code changes
  • Full Visibility: End-to-end request tracing
  • Log Correlation: Trace IDs in all log messages
  • Standards Based: W3C Trace Context, OTLP
  • Auto Instrumentation: Minimal code for basic coverage
  • Performance Insights: Identify slow operations

Negative

  • Learning Curve: OpenTelemetry concepts require understanding
  • Performance Overhead: ~2-3% for tracing
  • Configuration Complexity: Multiple exporters and processors
  • Data Volume: High trace volume needs sampling

Neutral

  • Export Flexibility: Can send to multiple backends
  • SDK Maturity: Python SDK stable, JavaScript improving

Alternatives Considered

1. Jaeger Client Direct

  • Approach: Jaeger-specific instrumentation
  • Rejected: Vendor lock-in, less ecosystem support

2. Datadog APM

  • Approach: Proprietary APM solution
  • Rejected: Vendor lock-in, licensing costs

3. Custom Instrumentation

  • Approach: Build tracing from scratch
  • Rejected: Reinventing standard solutions

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place

Implementation Details

  • Backend Telemetry: backend/core/telemetry/
  • Trace Setup: backend/core/telemetry/tracing.py
  • Metrics Setup: backend/core/telemetry/metrics.py
  • Frontend Tracing: frontend/src/telemetry/
  • Config: OTLP_ENDPOINT, OTEL_* environment variables
  • Docs: docs/TRACING_CONFIGURATION.md

Compliance/Validation

  • Automated checks: Trace coverage verified in tests
  • Manual review: Sampling rates reviewed for production
  • Metrics: Trace export success rate, span count

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH CRITICAL AMENDMENTS

Quality Metrics

  • Consensus Strength Score (CSS): 0.92
  • Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

The council approved OpenTelemetry as the correct strategic choice but identified two critical risks: legacy propagation standard (B3) and missing sampling strategy.

Key Concerns Identified:

  1. B3 Propagation is Anti-Pattern: Legacy Zipkin-era standard; W3C Trace Context is the modern standard
  2. No Sampling Strategy: 100% capture (AlwaysOn) is financially and computationally dangerous
  3. Missing Instrumentation: Celery, Redis, boto3, Kubernetes clients not covered
  4. Frontend Noise Risk: Frontend tracing can be extremely noisy (every click/scroll)

Required Modifications:

  1. Replace B3 with W3C Trace Context:
    • Primary: W3C traceparent
    • Fallback: Composite propagator accepting both W3C and B3 for legacy interop
  2. Define Sampling Strategy:
    • Head Sampling: ParentBasedTraceIdRatioBased (e.g., 10% healthy, 100% errors)
    • Tail Sampling: Configure OTel Collector to keep traces >500ms or with errors
    • Enforce BatchSpanProcessor with tuned queue sizes
  3. Expand Instrumentation:
    • Background workers: Celery, RQ, Kafka
    • Infrastructure: boto3, Google Cloud, Kubernetes clients
    • Caching: Redis hit/miss visibility
  4. ML Workloads: Custom spans must capture gen_ai.token.input_count, model.name
  5. Architecture: Require OpenTelemetry Collector as intermediary gateway

Modifications Applied

  1. Updated propagation standard to W3C Trace Context
  2. Documented hybrid sampling strategy (head + tail)
  3. Added Celery, Redis, infrastructure SDK instrumentation requirements
  4. Documented OTel Collector deployment pattern
  5. Added frontend tracing noise reduction guidelines

Council Ranking

  • gpt-5.2: Best Response (sampling strategy)
  • gemini-3-pro: Strong (propagation analysis)
  • grok-4.1: Good (instrumentation gaps)

Operational Guidelines (APPROVED_WITH_MODS)

Sampling Strategy Documentation

Head Sampling (Application Level):

# backend/core/telemetry/sampling.py
from opentelemetry.sdk.trace.sampling import (
ParentBasedTraceIdRatio,
TraceIdRatioBased,
ALWAYS_ON,
)

def get_sampler(environment: str) -> Sampler:
"""Get environment-appropriate sampler."""
if environment == "development":
return ALWAYS_ON # 100% for debugging
elif environment == "staging":
return ParentBasedTraceIdRatio(0.5) # 50%
else: # production
return ParentBasedTraceIdRatio(0.1) # 10% baseline

# Configure with custom logic for important traces
class AdaptiveSampler(Sampler):
"""Sample 100% of errors, slow requests, and critical paths."""

def __init__(self, base_ratio: float = 0.1):
self.base_ratio = base_ratio
self.base_sampler = TraceIdRatioBased(base_ratio)

def should_sample(self, context, trace_id, name, kind, attributes, links):
# Always sample errors
if attributes.get("error", False):
return Decision.RECORD_AND_SAMPLE

# Always sample slow operations
if attributes.get("http.duration_ms", 0) > 500:
return Decision.RECORD_AND_SAMPLE

# Always sample critical paths
critical_paths = ["/api/v1/auth", "/api/v1/slos", "/health"]
if any(path in name for path in critical_paths):
return Decision.RECORD_AND_SAMPLE

# Use base ratio for everything else
return self.base_sampler.should_sample(
context, trace_id, name, kind, attributes, links
)

Tail Sampling (Collector Level):

# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# Keep all error traces
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]

# Keep slow traces (>500ms)
- name: latency-policy
type: latency
latency:
threshold_ms: 500

# Keep traces from critical services
- name: critical-services
type: string_attribute
string_attribute:
key: service.name
values: [ops-auth, ops-slo-alerting]

# Probabilistic sampling for remainder
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10

Sampling Budget:

EnvironmentHead Sample RateTail Sample RateEst. Traces/Day
Development100%N/A~10K
Staging50%Keep errors + slow~100K
Production10% + smartKeep errors + slow~500K

Trace Context Propagation

W3C Trace Context (Primary):

# backend/core/telemetry/propagation.py
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.w3c import TraceContextTextMapPropagator
from opentelemetry.propagators.b3 import B3MultiFormat

# Configure composite propagator (W3C primary, B3 fallback)
propagator = CompositePropagator([
TraceContextTextMapPropagator(), # W3C traceparent (primary)
B3MultiFormat(), # Legacy B3 (fallback for older services)
])
set_global_textmap(propagator)

HTTP Header Propagation:

# W3C Trace Context Headers (RFC)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7

# Header Format:
# traceparent: version-trace_id-span_id-trace_flags
# version: 00 (current version)
# trace_id: 32 hex chars (16 bytes)
# span_id: 16 hex chars (8 bytes)
# trace_flags: 01 = sampled, 00 = not sampled

Cross-Service Propagation:

# FastAPI middleware for trace extraction
from opentelemetry.propagate import extract

@app.middleware("http")
async def trace_middleware(request: Request, call_next):
# Extract trace context from incoming headers
context = extract(request.headers)

# Create span with extracted context
with tracer.start_as_current_span(
f"{request.method} {request.url.path}",
context=context,
kind=SpanKind.SERVER,
) as span:
# Add request attributes
span.set_attribute("http.method", request.method)
span.set_attribute("http.url", str(request.url))

response = await call_next(request)

span.set_attribute("http.status_code", response.status_code)
return response

Frontend-to-Backend Propagation:

// frontend/src/telemetry/propagation.ts
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { context, propagation } from '@opentelemetry/api';

const propagator = new W3CTraceContextPropagator();

// Inject trace context into fetch headers
async function fetchWithTracing(url: string, options: RequestInit = {}) {
const headers = new Headers(options.headers);

// Inject current context into headers
propagation.inject(context.active(), headers, {
set: (carrier, key, value) => carrier.set(key, value),
});

return fetch(url, { ...options, headers });
}

Propagation Validation:

# Test trace context propagation
def test_trace_propagation():
"""Verify trace context flows through service boundaries."""
# Start a trace
with tracer.start_as_current_span("parent") as parent:
trace_id = parent.get_span_context().trace_id

# Make downstream request
response = client.get("/api/v1/requirements")

# Verify same trace_id in response headers
assert response.headers.get("X-Trace-Id") == format(trace_id, "032x")

References


ADR-009 | Observability Layer | Implemented | APPROVED_WITH_MODS Completed