Skip to main content

ADR-043: ClickStack Integration

Status

Implemented

Date

2025-01-16 (Retrospective)

Decision Makers

  • SRE Team - Observability backend
  • Architecture Team - Telemetry aggregation

Layer

Observability

  • ADR-009: OpenTelemetry Instrumentation
  • ADR-023: Prometheus + OpenTelemetry Metrics
  • ADR-024: Structured Logging with structlog

Supersedes

None

Depends On

  • ADR-009: OpenTelemetry Instrumentation

Context

Telemetry data needs a unified backend:

  1. Traces: Distributed tracing storage
  2. Metrics: Time-series metrics storage
  3. Logs: Centralized log aggregation
  4. Correlation: Connect traces, metrics, logs
  5. Visualization: Query and dashboard

Requirements:

  • OTLP ingestion for all signal types
  • ClickHouse for efficient storage
  • Grafana for visualization
  • Long-term retention
  • Cost-effective at scale

Decision

We integrate with ClickStack (ClickHouse-based observability):

Key Design Decisions

  1. ClickHouse Backend: Efficient columnar storage
  2. OTLP Ingestion: Standard protocol for all signals
  3. Grafana Frontend: Unified dashboards
  4. Trace Correlation: Link logs to traces
  5. HyperDX Compatibility: Open-source alternative

Architecture

Application
├── Traces → OTLP Exporter → ClickStack
├── Metrics → OTLP Exporter → ClickStack
└── Logs → structlog → stdout → Collector → ClickStack

ClickStack
├── ClickHouse (Storage)
├── OTLP Receiver (Ingestion)
└── Grafana (Visualization)

Configuration

# OpenTelemetry to ClickStack
OTEL_EXPORTER_OTLP_ENDPOINT = "http://clickstack:4317"
OTEL_EXPORTER_OTLP_PROTOCOL = "grpc"
OTEL_SERVICE_NAME = "ops-backend"

# Log shipping
LOKI_URL = "http://clickstack:3100" # If using Loki adapter

Trace Correlation

import structlog

logger = structlog.get_logger()

# Trace context automatically added
logger.info("Processing request",
trace_id=get_current_trace_id(),
span_id=get_current_span_id(),
)

Dashboard Queries

-- ClickHouse query for slow requests
SELECT
TraceId,
ServiceName,
Duration / 1000000 as DurationMs,
StatusCode
FROM otel_traces
WHERE Duration > 1000000000 -- > 1 second
ORDER BY Timestamp DESC
LIMIT 100

Consequences

Positive

  • Unified Backend: All signals in one place
  • Cost Effective: ClickHouse compression
  • Fast Queries: Columnar storage for analytics
  • Open Standards: OTLP compatible
  • Grafana Integration: Rich visualization

Negative

  • Operational Overhead: ClickHouse management
  • Learning Curve: ClickHouse query syntax
  • Resource Usage: ClickHouse needs resources
  • Retention Management: Must manage data growth

Neutral

  • Self-Hosted: Can be cloud or on-prem
  • Alternatives: Can switch to SaaS if needed

Alternatives Considered

1. Jaeger + Prometheus + Loki

  • Approach: Separate backends per signal
  • Rejected: More components, harder correlation

2. Datadog

  • Approach: SaaS observability platform
  • Rejected: Cost at scale, vendor lock-in

3. Elastic Stack

  • Approach: Elasticsearch-based
  • Rejected: Resource intensive, complex

Implementation Status

  • Core implementation complete
  • Tests written and passing
  • Documentation updated
  • Migration/upgrade path defined
  • Monitoring/observability in place
  • Buffering Layer (Issue #451) - Implemented 2025-01-16

Implementation Details

  • OTEL Config: backend/core/telemetry/
  • Docker: dev-tools/clickstack/docker-compose.yml
  • Dashboards: monitoring/grafana/dashboards/
  • Docs: docs/TRACING_CONFIGURATION.md

Buffering Layer Components (Issue #451)

The OTel Gateway buffering layer prevents traffic spikes from overwhelming ClickHouse.

Architecture

[Apps] --OTLP--> [Gateway:4317] --OTLP--> [Collector:4319] --> [ClickHouse]

Gateway Configuration (dev-tools/clickstack/configs/otel-collector/gateway.yaml)

The gateway provides:

  • Memory Limiter: 1536 MiB hard limit, 256 MiB spike buffer (soft limit at 1280 MiB)
  • Batch Processor: 8192 batch size, 5s timeout
  • Backpressure: Refuses data when hard limit reached
processors:
memory_limiter:
check_interval: 100ms
limit_mib: 1536 # Hard limit (fits in 2GB container)
spike_limit_mib: 256 # Soft limit starts at 1280 MiB
batch:
timeout: 5s
send_batch_size: 8192

Collector Configuration (dev-tools/clickstack/configs/otel-collector/collector.yaml)

The collector writes to ClickHouse with:

  • Optimized Batching: 10000 batch size, 10s timeout
  • Retry with Exponential Backoff: 5s initial, 30s max, 300s elapsed max
processors:
batch:
timeout: 10s
send_batch_size: 10000
exporters:
clickhouse:
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s

Docker Compose (dev-tools/clickstack/docker-compose.yml)

Two-tier collector architecture:

  • otel-gateway: Exposed ports 4317/4318, apps connect here
  • otel-collector: Internal port 4319, writes to ClickHouse

Python Buffering Module (backend/core/telemetry/buffering.py)

Testable abstractions for buffering behavior:

  • OTelGatewayBuffer: Simulates gateway memory/batch behavior
  • BatchProcessor: Tests batch aggregation
  • MemoryLimiter: Tests memory limit enforcement
  • RetryableExporter: Tests exponential backoff

LLM Council Review

Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED

Quality Metrics

  • Consensus Strength Score (CSS): 0.92
  • Deliberation Depth Index (DDI): 0.90

Council Feedback Summary

ClickHouse is the correct technical choice for high-scale, cost-efficient observability. However, the ADR underestimates operational complexity and lacks critical specifications for production deployment.

Key Concerns Identified:

  1. No Buffering Layer: Direct OTLP→ClickHouse allows ingestion spikes to crash DB ("Too many parts")
  2. Cluster Complexity: ZooKeeper/Keeper, sharding, replication not addressed
  3. Missing Schema Definition: Performance depends on ORDER BY keys; generic JSON blobs will be slow
  4. No Retention Tiers: Missing hot/cold storage architecture for cost optimization

Required Modifications:

  1. Add Buffering Layer: Insert Kafka or aggregated OTel Collector Gateway between apps and ClickHouse
    [Apps] --OTLP--> [OTel Gateway] --Batch--> [Kafka] --> [ClickHouse]
  2. Define Schema: Specify ORDER BY keys for query performance
    • Example: ORDER BY (service_name, timestamp, trace_id)
    • Use Skipping Indices (Bloom Filters) for trace_id lookups
  3. Tiered Storage Policy:
    • Hot: NVMe SSD for recent data (3-7 days)
    • Cold: S3/Object storage for older data
  4. Per-Signal Retention:
    • Traces: 7 days
    • Metrics: 1 year
    • Logs: 30 days
  5. Operational Strategy: Document if self-hosted (requires 3+ replicas, keeper cluster) or managed service
  6. Structured Logging: Document that ClickHouse is not Elasticsearch; free-text search is weaker

Modifications Applied

  1. Documented buffering layer requirement
  2. Added schema definition guidance
  3. Documented tiered storage policy
  4. Added per-signal retention recommendations
  5. Documented operational strategy decision

Issue #451 Implementation (2025-01-16)

The primary blocking concern (buffering layer) has been fully addressed:

ConcernResolution
No Buffering LayerOTel Gateway with memory_limiter (1536 MiB hard limit) and batch processor (8192 size, 5s timeout)
Direct OTLP→ClickHouseTwo-tier architecture: Gateway (4317) → Collector (4319) → ClickHouse
Traffic Spike CrashesMemory limiter with 400 MiB spike buffer applies backpressure
Too Many Parts ErrorCollector batching: 10000 batch size, 10s timeout
No Retry LogicExponential backoff: 5s initial, 30s max, 300s elapsed max

Verdict Update: APPROVED - Buffering layer implementation addresses the primary blocking concern

Council Ranking

  • gpt-5.2: Best Response (buffering)
  • claude-opus-4.5: Strong (schema)
  • gemini-3-pro: Good (retention)

References


ADR-043 | Observability Layer | Implemented