ADR-009: OpenTelemetry Instrumentation
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- SRE Team - Observability requirements
- Architecture Team - Integration patterns
Layer
Observability
Related ADRs
- ADR-023: Prometheus + OpenTelemetry Metrics
- ADR-024: Structured Logging with structlog
- ADR-043: ClickStack Integration
Supersedes
None
Depends On
- ADR-008: FastAPI with Pydantic (instrumented application)
Context
The SRE Operations Platform requires comprehensive observability:
- Distributed Tracing: Follow requests across services
- Metrics Collection: Performance and business metrics
- Log Correlation: Connect logs to traces
- Vendor Neutrality: Avoid lock-in to specific backends
- Standards Compliance: Industry-standard telemetry
Key constraints:
- Must work with multiple backends (Jaeger, ClickStack, etc.)
- Need automatic instrumentation for common libraries
- Require minimal code changes for adoption
- Must support both backend and frontend
- Need production-grade performance impact (<5%)
Decision
We adopt OpenTelemetry as the primary instrumentation framework:
Key Design Decisions
- OpenTelemetry SDK: Vendor-neutral telemetry collection
- Automatic Instrumentation: FastAPI, SQLAlchemy, httpx auto-instrumented
- OTLP Export: Standard protocol for all telemetry types
- B3 Propagation: Distributed trace context across services
- Custom Spans: Business-critical operations manually traced
Backend Configuration
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure tracer
provider = TracerProvider(resource=Resource.create({
"service.name": "ops-backend",
"deployment.environment": settings.environment,
}))
# Add OTLP exporter
otlp_exporter = OTLPSpanExporter(
endpoint=settings.otlp_endpoint,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
Frontend Configuration
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
const provider = new WebTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'ops-frontend',
}),
});
// Auto-instrument fetch calls
registerInstrumentations({
instrumentations: [
new FetchInstrumentation(),
new DocumentLoadInstrumentation(),
],
});
Trace Span Types
| Span Type | Purpose | Example |
|---|---|---|
| HTTP | Request/response | GET /api/v1/requirements |
| DB | Database queries | SELECT requirements |
| Custom | Business logic | generate_embeddings |
| External | Third-party calls | OpenAI API |
Consequences
Positive
- Vendor Neutral: Switch backends without code changes
- Full Visibility: End-to-end request tracing
- Log Correlation: Trace IDs in all log messages
- Standards Based: W3C Trace Context, OTLP
- Auto Instrumentation: Minimal code for basic coverage
- Performance Insights: Identify slow operations
Negative
- Learning Curve: OpenTelemetry concepts require understanding
- Performance Overhead: ~2-3% for tracing
- Configuration Complexity: Multiple exporters and processors
- Data Volume: High trace volume needs sampling
Neutral
- Export Flexibility: Can send to multiple backends
- SDK Maturity: Python SDK stable, JavaScript improving
Alternatives Considered
1. Jaeger Client Direct
- Approach: Jaeger-specific instrumentation
- Rejected: Vendor lock-in, less ecosystem support
2. Datadog APM
- Approach: Proprietary APM solution
- Rejected: Vendor lock-in, licensing costs
3. Custom Instrumentation
- Approach: Build tracing from scratch
- Rejected: Reinventing standard solutions
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
Implementation Details
- Backend Telemetry:
backend/core/telemetry/ - Trace Setup:
backend/core/telemetry/tracing.py - Metrics Setup:
backend/core/telemetry/metrics.py - Frontend Tracing:
frontend/src/telemetry/ - Config:
OTLP_ENDPOINT,OTEL_*environment variables - Docs:
docs/TRACING_CONFIGURATION.md
Compliance/Validation
- Automated checks: Trace coverage verified in tests
- Manual review: Sampling rates reviewed for production
- Metrics: Trace export success rate, span count
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH CRITICAL AMENDMENTS
Quality Metrics
- Consensus Strength Score (CSS): 0.92
- Deliberation Depth Index (DDI): 0.88
Council Feedback Summary
The council approved OpenTelemetry as the correct strategic choice but identified two critical risks: legacy propagation standard (B3) and missing sampling strategy.
Key Concerns Identified:
- B3 Propagation is Anti-Pattern: Legacy Zipkin-era standard; W3C Trace Context is the modern standard
- No Sampling Strategy: 100% capture (AlwaysOn) is financially and computationally dangerous
- Missing Instrumentation: Celery, Redis, boto3, Kubernetes clients not covered
- Frontend Noise Risk: Frontend tracing can be extremely noisy (every click/scroll)
Required Modifications:
- Replace B3 with W3C Trace Context:
- Primary: W3C
traceparent - Fallback: Composite propagator accepting both W3C and B3 for legacy interop
- Primary: W3C
- Define Sampling Strategy:
- Head Sampling:
ParentBasedTraceIdRatioBased(e.g., 10% healthy, 100% errors) - Tail Sampling: Configure OTel Collector to keep traces >500ms or with errors
- Enforce
BatchSpanProcessorwith tuned queue sizes
- Head Sampling:
- Expand Instrumentation:
- Background workers: Celery, RQ, Kafka
- Infrastructure: boto3, Google Cloud, Kubernetes clients
- Caching: Redis hit/miss visibility
- ML Workloads: Custom spans must capture
gen_ai.token.input_count,model.name - Architecture: Require OpenTelemetry Collector as intermediary gateway
Modifications Applied
- Updated propagation standard to W3C Trace Context
- Documented hybrid sampling strategy (head + tail)
- Added Celery, Redis, infrastructure SDK instrumentation requirements
- Documented OTel Collector deployment pattern
- Added frontend tracing noise reduction guidelines
Council Ranking
- gpt-5.2: Best Response (sampling strategy)
- gemini-3-pro: Strong (propagation analysis)
- grok-4.1: Good (instrumentation gaps)
Operational Guidelines (APPROVED_WITH_MODS)
Sampling Strategy Documentation
Head Sampling (Application Level):
# backend/core/telemetry/sampling.py
from opentelemetry.sdk.trace.sampling import (
ParentBasedTraceIdRatio,
TraceIdRatioBased,
ALWAYS_ON,
)
def get_sampler(environment: str) -> Sampler:
"""Get environment-appropriate sampler."""
if environment == "development":
return ALWAYS_ON # 100% for debugging
elif environment == "staging":
return ParentBasedTraceIdRatio(0.5) # 50%
else: # production
return ParentBasedTraceIdRatio(0.1) # 10% baseline
# Configure with custom logic for important traces
class AdaptiveSampler(Sampler):
"""Sample 100% of errors, slow requests, and critical paths."""
def __init__(self, base_ratio: float = 0.1):
self.base_ratio = base_ratio
self.base_sampler = TraceIdRatioBased(base_ratio)
def should_sample(self, context, trace_id, name, kind, attributes, links):
# Always sample errors
if attributes.get("error", False):
return Decision.RECORD_AND_SAMPLE
# Always sample slow operations
if attributes.get("http.duration_ms", 0) > 500:
return Decision.RECORD_AND_SAMPLE
# Always sample critical paths
critical_paths = ["/api/v1/auth", "/api/v1/slos", "/health"]
if any(path in name for path in critical_paths):
return Decision.RECORD_AND_SAMPLE
# Use base ratio for everything else
return self.base_sampler.should_sample(
context, trace_id, name, kind, attributes, links
)
Tail Sampling (Collector Level):
# otel-collector-config.yaml
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# Keep all error traces
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Keep slow traces (>500ms)
- name: latency-policy
type: latency
latency:
threshold_ms: 500
# Keep traces from critical services
- name: critical-services
type: string_attribute
string_attribute:
key: service.name
values: [ops-auth, ops-slo-alerting]
# Probabilistic sampling for remainder
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
Sampling Budget:
| Environment | Head Sample Rate | Tail Sample Rate | Est. Traces/Day |
|---|---|---|---|
| Development | 100% | N/A | ~10K |
| Staging | 50% | Keep errors + slow | ~100K |
| Production | 10% + smart | Keep errors + slow | ~500K |
Trace Context Propagation
W3C Trace Context (Primary):
# backend/core/telemetry/propagation.py
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.w3c import TraceContextTextMapPropagator
from opentelemetry.propagators.b3 import B3MultiFormat
# Configure composite propagator (W3C primary, B3 fallback)
propagator = CompositePropagator([
TraceContextTextMapPropagator(), # W3C traceparent (primary)
B3MultiFormat(), # Legacy B3 (fallback for older services)
])
set_global_textmap(propagator)
HTTP Header Propagation:
# W3C Trace Context Headers (RFC)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7
# Header Format:
# traceparent: version-trace_id-span_id-trace_flags
# version: 00 (current version)
# trace_id: 32 hex chars (16 bytes)
# span_id: 16 hex chars (8 bytes)
# trace_flags: 01 = sampled, 00 = not sampled
Cross-Service Propagation:
# FastAPI middleware for trace extraction
from opentelemetry.propagate import extract
@app.middleware("http")
async def trace_middleware(request: Request, call_next):
# Extract trace context from incoming headers
context = extract(request.headers)
# Create span with extracted context
with tracer.start_as_current_span(
f"{request.method} {request.url.path}",
context=context,
kind=SpanKind.SERVER,
) as span:
# Add request attributes
span.set_attribute("http.method", request.method)
span.set_attribute("http.url", str(request.url))
response = await call_next(request)
span.set_attribute("http.status_code", response.status_code)
return response
Frontend-to-Backend Propagation:
// frontend/src/telemetry/propagation.ts
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { context, propagation } from '@opentelemetry/api';
const propagator = new W3CTraceContextPropagator();
// Inject trace context into fetch headers
async function fetchWithTracing(url: string, options: RequestInit = {}) {
const headers = new Headers(options.headers);
// Inject current context into headers
propagation.inject(context.active(), headers, {
set: (carrier, key, value) => carrier.set(key, value),
});
return fetch(url, { ...options, headers });
}
Propagation Validation:
# Test trace context propagation
def test_trace_propagation():
"""Verify trace context flows through service boundaries."""
# Start a trace
with tracer.start_as_current_span("parent") as parent:
trace_id = parent.get_span_context().trace_id
# Make downstream request
response = client.get("/api/v1/requirements")
# Verify same trace_id in response headers
assert response.headers.get("X-Trace-Id") == format(trace_id, "032x")
References
- OpenTelemetry Documentation
- OTLP Specification
- W3C Trace Context
- OpenTelemetry Sampling
- Tail-Based Sampling
ADR-009 | Observability Layer | Implemented | APPROVED_WITH_MODS Completed