ADR-043: ClickStack Integration
Status
Implemented
Date
2025-01-16 (Retrospective)
Decision Makers
- SRE Team - Observability backend
- Architecture Team - Telemetry aggregation
Layer
Observability
Related ADRs
- ADR-009: OpenTelemetry Instrumentation
- ADR-023: Prometheus + OpenTelemetry Metrics
- ADR-024: Structured Logging with structlog
Supersedes
None
Depends On
- ADR-009: OpenTelemetry Instrumentation
Context
Telemetry data needs a unified backend:
- Traces: Distributed tracing storage
- Metrics: Time-series metrics storage
- Logs: Centralized log aggregation
- Correlation: Connect traces, metrics, logs
- Visualization: Query and dashboard
Requirements:
- OTLP ingestion for all signal types
- ClickHouse for efficient storage
- Grafana for visualization
- Long-term retention
- Cost-effective at scale
Decision
We integrate with ClickStack (ClickHouse-based observability):
Key Design Decisions
- ClickHouse Backend: Efficient columnar storage
- OTLP Ingestion: Standard protocol for all signals
- Grafana Frontend: Unified dashboards
- Trace Correlation: Link logs to traces
- HyperDX Compatibility: Open-source alternative
Architecture
Application
├── Traces → OTLP Exporter → ClickStack
├── Metrics → OTLP Exporter → ClickStack
└── Logs → structlog → stdout → Collector → ClickStack
ClickStack
├── ClickHouse (Storage)
├── OTLP Receiver (Ingestion)
└── Grafana (Visualization)
Configuration
# OpenTelemetry to ClickStack
OTEL_EXPORTER_OTLP_ENDPOINT = "http://clickstack:4317"
OTEL_EXPORTER_OTLP_PROTOCOL = "grpc"
OTEL_SERVICE_NAME = "ops-backend"
# Log shipping
LOKI_URL = "http://clickstack:3100" # If using Loki adapter
Trace Correlation
import structlog
logger = structlog.get_logger()
# Trace context automatically added
logger.info("Processing request",
trace_id=get_current_trace_id(),
span_id=get_current_span_id(),
)
Dashboard Queries
-- ClickHouse query for slow requests
SELECT
TraceId,
ServiceName,
Duration / 1000000 as DurationMs,
StatusCode
FROM otel_traces
WHERE Duration > 1000000000 -- > 1 second
ORDER BY Timestamp DESC
LIMIT 100
Consequences
Positive
- Unified Backend: All signals in one place
- Cost Effective: ClickHouse compression
- Fast Queries: Columnar storage for analytics
- Open Standards: OTLP compatible
- Grafana Integration: Rich visualization
Negative
- Operational Overhead: ClickHouse management
- Learning Curve: ClickHouse query syntax
- Resource Usage: ClickHouse needs resources
- Retention Management: Must manage data growth
Neutral
- Self-Hosted: Can be cloud or on-prem
- Alternatives: Can switch to SaaS if needed
Alternatives Considered
1. Jaeger + Prometheus + Loki
- Approach: Separate backends per signal
- Rejected: More components, harder correlation
2. Datadog
- Approach: SaaS observability platform
- Rejected: Cost at scale, vendor lock-in
3. Elastic Stack
- Approach: Elasticsearch-based
- Rejected: Resource intensive, complex
Implementation Status
- Core implementation complete
- Tests written and passing
- Documentation updated
- Migration/upgrade path defined
- Monitoring/observability in place
- Buffering Layer (Issue #451) - Implemented 2025-01-16
Implementation Details
- OTEL Config:
backend/core/telemetry/ - Docker:
dev-tools/clickstack/docker-compose.yml - Dashboards:
monitoring/grafana/dashboards/ - Docs:
docs/TRACING_CONFIGURATION.md
Buffering Layer Components (Issue #451)
The OTel Gateway buffering layer prevents traffic spikes from overwhelming ClickHouse.
Architecture
[Apps] --OTLP--> [Gateway:4317] --OTLP--> [Collector:4319] --> [ClickHouse]
Gateway Configuration (dev-tools/clickstack/configs/otel-collector/gateway.yaml)
The gateway provides:
- Memory Limiter: 1536 MiB hard limit, 256 MiB spike buffer (soft limit at 1280 MiB)
- Batch Processor: 8192 batch size, 5s timeout
- Backpressure: Refuses data when hard limit reached
processors:
memory_limiter:
check_interval: 100ms
limit_mib: 1536 # Hard limit (fits in 2GB container)
spike_limit_mib: 256 # Soft limit starts at 1280 MiB
batch:
timeout: 5s
send_batch_size: 8192
Collector Configuration (dev-tools/clickstack/configs/otel-collector/collector.yaml)
The collector writes to ClickHouse with:
- Optimized Batching: 10000 batch size, 10s timeout
- Retry with Exponential Backoff: 5s initial, 30s max, 300s elapsed max
processors:
batch:
timeout: 10s
send_batch_size: 10000
exporters:
clickhouse:
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
Docker Compose (dev-tools/clickstack/docker-compose.yml)
Two-tier collector architecture:
- otel-gateway: Exposed ports 4317/4318, apps connect here
- otel-collector: Internal port 4319, writes to ClickHouse
Python Buffering Module (backend/core/telemetry/buffering.py)
Testable abstractions for buffering behavior:
OTelGatewayBuffer: Simulates gateway memory/batch behaviorBatchProcessor: Tests batch aggregationMemoryLimiter: Tests memory limit enforcementRetryableExporter: Tests exponential backoff
LLM Council Review
Review Date: 2025-01-16 Confidence Level: High (100%) Verdict: APPROVED
Quality Metrics
- Consensus Strength Score (CSS): 0.92
- Deliberation Depth Index (DDI): 0.90
Council Feedback Summary
ClickHouse is the correct technical choice for high-scale, cost-efficient observability. However, the ADR underestimates operational complexity and lacks critical specifications for production deployment.
Key Concerns Identified:
- No Buffering Layer: Direct OTLP→ClickHouse allows ingestion spikes to crash DB ("Too many parts")
- Cluster Complexity: ZooKeeper/Keeper, sharding, replication not addressed
- Missing Schema Definition: Performance depends on ORDER BY keys; generic JSON blobs will be slow
- No Retention Tiers: Missing hot/cold storage architecture for cost optimization
Required Modifications:
- Add Buffering Layer: Insert Kafka or aggregated OTel Collector Gateway between apps and ClickHouse
[Apps] --OTLP--> [OTel Gateway] --Batch--> [Kafka] --> [ClickHouse] - Define Schema: Specify ORDER BY keys for query performance
- Example:
ORDER BY (service_name, timestamp, trace_id) - Use Skipping Indices (Bloom Filters) for trace_id lookups
- Example:
- Tiered Storage Policy:
- Hot: NVMe SSD for recent data (3-7 days)
- Cold: S3/Object storage for older data
- Per-Signal Retention:
- Traces: 7 days
- Metrics: 1 year
- Logs: 30 days
- Operational Strategy: Document if self-hosted (requires 3+ replicas, keeper cluster) or managed service
- Structured Logging: Document that ClickHouse is not Elasticsearch; free-text search is weaker
Modifications Applied
- Documented buffering layer requirement
- Added schema definition guidance
- Documented tiered storage policy
- Added per-signal retention recommendations
- Documented operational strategy decision
Issue #451 Implementation (2025-01-16)
The primary blocking concern (buffering layer) has been fully addressed:
| Concern | Resolution |
|---|---|
| No Buffering Layer | OTel Gateway with memory_limiter (1536 MiB hard limit) and batch processor (8192 size, 5s timeout) |
| Direct OTLP→ClickHouse | Two-tier architecture: Gateway (4317) → Collector (4319) → ClickHouse |
| Traffic Spike Crashes | Memory limiter with 400 MiB spike buffer applies backpressure |
| Too Many Parts Error | Collector batching: 10000 batch size, 10s timeout |
| No Retry Logic | Exponential backoff: 5s initial, 30s max, 300s elapsed max |
Verdict Update: APPROVED - Buffering layer implementation addresses the primary blocking concern
Council Ranking
- gpt-5.2: Best Response (buffering)
- claude-opus-4.5: Strong (schema)
- gemini-3-pro: Good (retention)
References
ADR-043 | Observability Layer | Implemented