ADR-009: OpenTelemetry Instrumentation

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

SRE Team - Observability requirements
Architecture Team - Integration patterns

Layer

Observability

ADR-023: Prometheus + OpenTelemetry Metrics
ADR-024: Structured Logging with structlog
ADR-043: ClickStack Integration

Supersedes

None

Depends On

ADR-008: FastAPI with Pydantic (instrumented application)

Context

The SRE Operations Platform requires comprehensive observability:

Distributed Tracing: Follow requests across services
Metrics Collection: Performance and business metrics
Log Correlation: Connect logs to traces
Vendor Neutrality: Avoid lock-in to specific backends
Standards Compliance: Industry-standard telemetry

Key constraints:

Must work with multiple backends (Jaeger, ClickStack, etc.)
Need automatic instrumentation for common libraries
Require minimal code changes for adoption
Must support both backend and frontend
Need production-grade performance impact (<5%)

Decision

We adopt OpenTelemetry as the primary instrumentation framework:

Key Design Decisions

OpenTelemetry SDK: Vendor-neutral telemetry collection
Automatic Instrumentation: FastAPI, SQLAlchemy, httpx auto-instrumented
OTLP Export: Standard protocol for all telemetry types
B3 Propagation: Distributed trace context across services
Custom Spans: Business-critical operations manually traced

Backend Configuration

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure tracer
provider = TracerProvider(resource=Resource.create({
    "service.name": "ops-backend",
    "deployment.environment": settings.environment,
}))

# Add OTLP exporter
otlp_exporter = OTLPSpanExporter(
    endpoint=settings.otlp_endpoint,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

Frontend Configuration

import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';

const provider = new WebTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'ops-frontend',
  }),
});

// Auto-instrument fetch calls
registerInstrumentations({
  instrumentations: [
    new FetchInstrumentation(),
    new DocumentLoadInstrumentation(),
  ],
});

Trace Span Types

Span Type	Purpose	Example
HTTP	Request/response	`GET /api/v1/requirements`
DB	Database queries	`SELECT requirements`
Custom	Business logic	`generate_embeddings`
External	Third-party calls	`OpenAI API`

Consequences

Positive

Vendor Neutral: Switch backends without code changes
Full Visibility: End-to-end request tracing
Log Correlation: Trace IDs in all log messages
Standards Based: W3C Trace Context, OTLP
Auto Instrumentation: Minimal code for basic coverage
Performance Insights: Identify slow operations

Negative

Learning Curve: OpenTelemetry concepts require understanding
Performance Overhead: ~2-3% for tracing
Configuration Complexity: Multiple exporters and processors
Data Volume: High trace volume needs sampling

Neutral

Export Flexibility: Can send to multiple backends
SDK Maturity: Python SDK stable, JavaScript improving

Alternatives Considered

1. Jaeger Client Direct

Approach: Jaeger-specific instrumentation
Rejected: Vendor lock-in, less ecosystem support

2. Datadog APM

Approach: Proprietary APM solution
Rejected: Vendor lock-in, licensing costs

3. Custom Instrumentation

Approach: Build tracing from scratch
Rejected: Reinventing standard solutions

Implementation Status

Implementation Details

Backend Telemetry: backend/core/telemetry/
Trace Setup: backend/core/telemetry/tracing.py
Metrics Setup: backend/core/telemetry/metrics.py
Frontend Tracing: frontend/src/telemetry/
Config: OTLP_ENDPOINT, OTEL_* environment variables
Docs: docs/TRACING_CONFIGURATION.md

Compliance/Validation

Automated checks: Trace coverage verified in tests
Manual review: Sampling rates reviewed for production
Metrics: Trace export success rate, span count

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED WITH CRITICAL AMENDMENTS

Quality Metrics

Consensus Strength Score (CSS): 0.92
Deliberation Depth Index (DDI): 0.88

Council Feedback Summary

The council approved OpenTelemetry as the correct strategic choice but identified two critical risks: legacy propagation standard (B3) and missing sampling strategy.

Key Concerns Identified:

B3 Propagation is Anti-Pattern: Legacy Zipkin-era standard; W3C Trace Context is the modern standard
No Sampling Strategy: 100% capture (AlwaysOn) is financially and computationally dangerous
Missing Instrumentation: Celery, Redis, boto3, Kubernetes clients not covered
Frontend Noise Risk: Frontend tracing can be extremely noisy (every click/scroll)

Required Modifications:

Replace B3 with W3C Trace Context:
- Primary: W3C traceparent
- Fallback: Composite propagator accepting both W3C and B3 for legacy interop
Define Sampling Strategy:
- Head Sampling: ParentBasedTraceIdRatioBased (e.g., 10% healthy, 100% errors)
- Tail Sampling: Configure OTel Collector to keep traces >500ms or with errors
- Enforce BatchSpanProcessor with tuned queue sizes
Expand Instrumentation:
- Background workers: Celery, RQ, Kafka
- Infrastructure: boto3, Google Cloud, Kubernetes clients
- Caching: Redis hit/miss visibility
ML Workloads: Custom spans must capture gen_ai.token.input_count, model.name
Architecture: Require OpenTelemetry Collector as intermediary gateway

Modifications Applied

Updated propagation standard to W3C Trace Context
Documented hybrid sampling strategy (head + tail)
Added Celery, Redis, infrastructure SDK instrumentation requirements
Documented OTel Collector deployment pattern
Added frontend tracing noise reduction guidelines

Council Ranking

gpt-5.2: Best Response (sampling strategy)
gemini-3-pro: Strong (propagation analysis)
grok-4.1: Good (instrumentation gaps)

Operational Guidelines (APPROVED_WITH_MODS)

Sampling Strategy Documentation

Head Sampling (Application Level):

# backend/core/telemetry/sampling.py
from opentelemetry.sdk.trace.sampling import (
    ParentBasedTraceIdRatio,
    TraceIdRatioBased,
    ALWAYS_ON,
)

def get_sampler(environment: str) -> Sampler:
    """Get environment-appropriate sampler."""
    if environment == "development":
        return ALWAYS_ON  # 100% for debugging
    elif environment == "staging":
        return ParentBasedTraceIdRatio(0.5)  # 50%
    else:  # production
        return ParentBasedTraceIdRatio(0.1)  # 10% baseline

# Configure with custom logic for important traces
class AdaptiveSampler(Sampler):
    """Sample 100% of errors, slow requests, and critical paths."""

    def __init__(self, base_ratio: float = 0.1):
        self.base_ratio = base_ratio
        self.base_sampler = TraceIdRatioBased(base_ratio)

    def should_sample(self, context, trace_id, name, kind, attributes, links):
        # Always sample errors
        if attributes.get("error", False):
            return Decision.RECORD_AND_SAMPLE

        # Always sample slow operations
        if attributes.get("http.duration_ms", 0) > 500:
            return Decision.RECORD_AND_SAMPLE

        # Always sample critical paths
        critical_paths = ["/api/v1/auth", "/api/v1/slos", "/health"]
        if any(path in name for path in critical_paths):
            return Decision.RECORD_AND_SAMPLE

        # Use base ratio for everything else
        return self.base_sampler.should_sample(
            context, trace_id, name, kind, attributes, links
        )

Tail Sampling (Collector Level):

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Keep all error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Keep slow traces (>500ms)
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 500

      # Keep traces from critical services
      - name: critical-services
        type: string_attribute
        string_attribute:
          key: service.name
          values: [ops-auth, ops-slo-alerting]

      # Probabilistic sampling for remainder
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Sampling Budget:

Environment	Head Sample Rate	Tail Sample Rate	Est. Traces/Day
Development	100%	N/A	~10K
Staging	50%	Keep errors + slow	~100K
Production	10% + smart	Keep errors + slow	~500K

Trace Context Propagation

W3C Trace Context (Primary):

# backend/core/telemetry/propagation.py
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.propagators.w3c import TraceContextTextMapPropagator
from opentelemetry.propagators.b3 import B3MultiFormat

# Configure composite propagator (W3C primary, B3 fallback)
propagator = CompositePropagator([
    TraceContextTextMapPropagator(),  # W3C traceparent (primary)
    B3MultiFormat(),  # Legacy B3 (fallback for older services)
])
set_global_textmap(propagator)

HTTP Header Propagation:

# W3C Trace Context Headers (RFC)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE,rojo=00f067aa0ba902b7

# Header Format:
# traceparent: version-trace_id-span_id-trace_flags
#   version: 00 (current version)
#   trace_id: 32 hex chars (16 bytes)
#   span_id: 16 hex chars (8 bytes)
#   trace_flags: 01 = sampled, 00 = not sampled

Cross-Service Propagation:

# FastAPI middleware for trace extraction
from opentelemetry.propagate import extract

@app.middleware("http")
async def trace_middleware(request: Request, call_next):
    # Extract trace context from incoming headers
    context = extract(request.headers)

    # Create span with extracted context
    with tracer.start_as_current_span(
        f"{request.method} {request.url.path}",
        context=context,
        kind=SpanKind.SERVER,
    ) as span:
        # Add request attributes
        span.set_attribute("http.method", request.method)
        span.set_attribute("http.url", str(request.url))

        response = await call_next(request)

        span.set_attribute("http.status_code", response.status_code)
        return response

Frontend-to-Backend Propagation:

// frontend/src/telemetry/propagation.ts
import { W3CTraceContextPropagator } from '@opentelemetry/core';
import { context, propagation } from '@opentelemetry/api';

const propagator = new W3CTraceContextPropagator();

// Inject trace context into fetch headers
async function fetchWithTracing(url: string, options: RequestInit = {}) {
  const headers = new Headers(options.headers);

  // Inject current context into headers
  propagation.inject(context.active(), headers, {
    set: (carrier, key, value) => carrier.set(key, value),
  });

  return fetch(url, { ...options, headers });
}

Propagation Validation:

# Test trace context propagation
def test_trace_propagation():
    """Verify trace context flows through service boundaries."""
    # Start a trace
    with tracer.start_as_current_span("parent") as parent:
        trace_id = parent.get_span_context().trace_id

        # Make downstream request
        response = client.get("/api/v1/requirements")

        # Verify same trace_id in response headers
        assert response.headers.get("X-Trace-Id") == format(trace_id, "032x")

References

ADR-009 | Observability Layer | Implemented | APPROVED_WITH_MODS Completed

Status​

Date​

Decision Makers​

Layer​

Related ADRs​

Supersedes​

Depends On​

Context​

Decision​

Key Design Decisions​

Backend Configuration​

Frontend Configuration​

Trace Span Types​

Consequences​

Positive​

Negative​

Neutral​

Alternatives Considered​

1. Jaeger Client Direct​

2. Datadog APM​

3. Custom Instrumentation​

Implementation Status​

Implementation Details​

Compliance/Validation​

LLM Council Review​

Quality Metrics​

Council Feedback Summary​

Key Concerns Identified:​

Required Modifications:​

Modifications Applied​

Council Ranking​

Operational Guidelines (APPROVED_WITH_MODS)​

Sampling Strategy Documentation​

Trace Context Propagation​

References​