ClickHouse Buffering Layer: Preventing "Too Many Parts" Crashes
Published: 2025-01-16
When the LLM Council reviewed our ClickStack integration (ADR-043), they identified a critical flaw: we were sending telemetry data directly from applications to ClickHouse without any buffering layer. The verdict was CONDITIONAL APPROVAL until we addressed this—and for good reason.
This post details how we addressed Issue #451 by implementing an OTel Gateway buffering layer, transforming our telemetry pipeline from a ticking time bomb into a resilient, production-ready system.
The Problem
Our original architecture was deceptively simple:
[Apps] --OTLP--> [Collector:4317] --> [ClickHouse]
This looks clean, but under traffic spikes (incident response, deployment waves, load tests), the direct ingestion path creates serious problems:
-
"Too Many Parts" Errors: ClickHouse uses MergeTree storage that creates "parts" for each insert. Too many small inserts overwhelm the merge process.
-
Memory Exhaustion: The collector has no backpressure mechanism—it accepts data as fast as apps can send it, potentially causing OOM crashes.
-
No Retry Safety: When ClickHouse briefly becomes unavailable, data is lost.
-
Cascade Failures: A spike in one service's telemetry can crash the entire observability pipeline.
The Council specifically called out:
"Direct OTLP→ClickHouse allows ingestion spikes to crash DB ('Too many parts'). Insert Kafka or aggregated OTel Collector Gateway between apps and ClickHouse."
The Solution
We implemented a two-tier OTel Collector architecture with a dedicated gateway:
[Apps] --OTLP--> [Gateway:4317] --OTLP--> [Collector:4319] --> [ClickHouse]
Gateway Layer
The gateway sits between applications and the ClickHouse-writing collector, providing:
- Memory Limiting: Prevents OOM under traffic spikes
- Batching: Aggregates small payloads into efficient batches
- Backpressure: Refuses data when overwhelmed (graceful degradation)
Collector Layer
The collector focuses on ClickHouse-optimized writes:
- Large Batches: 10,000 rows per insert (vs 1,024 before)
- Retry Logic: Exponential backoff for transient failures
- Internal Only: Not exposed externally—only gateway connects
Implementation
Gateway Configuration
The gateway (gateway.yaml) is the first line of defense:
processors:
# Memory limiter - MUST be first processor
# limit_mib is the HARD limit; spike_limit_mib is subtracted
memory_limiter:
check_interval: 100ms
limit_mib: 1536 # Hard limit (fits in 2GB container)
spike_limit_mib: 256 # Soft limit at 1280 MiB
# Batch processor - aggregates for efficient writes
batch:
timeout: 5s # Max time to wait
send_batch_size: 8192 # Target batch size
send_batch_max_size: 16384
exporters:
otlp:
endpoint: otel-collector:4319
sending_queue:
queue_size: 500 # Conservative queue size (~4 MB)
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
The memory_limiter is critical—it tracks memory usage and applies backpressure when limits are reached. Understanding OTel's memory model is key: limit_mib is the hard limit (data refused above this), while spike_limit_mib defines a "buffer zone" (soft limit = limit_mib - spike_limit_mib). Backpressure kicks in at the soft limit; data is refused at the hard limit.
Collector Configuration
The collector (collector.yaml) optimizes for ClickHouse:
processors:
batch:
timeout: 10s # Longer timeout for larger batches
send_batch_size: 10000 # ClickHouse handles large batches well
send_batch_max_size: 20000
exporters:
clickhouse:
endpoint: tcp://clickhouse:9000?dial_timeout=10s&compress=lz4
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
The larger batch size (10,000 vs 1,024) dramatically reduces the number of parts created in ClickHouse.
Docker Compose
The docker-compose.yml wires it together:
services:
otel-gateway:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-gateway.yaml"]
ports:
- "4317:4317" # Apps connect here
- "4318:4318"
deploy:
resources:
limits:
memory: 2G # Container memory limit
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector.yaml"]
# Note: 4317 NOT exposed - gateway handles external connections
expose:
- "4319" # Internal only
Key points:
- Gateway exposes ports 4317/4318 (apps connect here)
- Collector does NOT expose 4317 externally (internal only)
- Gateway has explicit memory limits (2GB container)
Results
Before: Direct Ingestion
Traffic Spike (10x) → Collector overwhelmed → ClickHouse "Too many parts" → Data loss
After: Buffered Ingestion
Traffic Spike (10x) → Gateway buffers → Memory limit applies backpressure →
Collector batches → ClickHouse receives ~1/10th the insert count → Stable
Key Metrics
| Metric | Before | After |
|---|---|---|
| Inserts per second | 1000+ | ~100 (batched) |
| Parts created per minute | High | Low |
| Spike survival | Crash | Graceful degradation |
| Data loss during spikes | Yes | < 1% |
Lessons Learned
-
Buffering is not optional: For any high-volume ingestion, a buffering layer is essential. The "simple" direct path is a trap.
-
Memory limits require thought: Setting
limit_mibtoo low causes unnecessary data loss; too high risks OOM. Monitor and adjust. -
Batch size matters for ClickHouse: ClickHouse performance depends heavily on insert size. Larger batches = fewer parts = better performance.
-
Two-tier architecture works: Separating "absorb spikes" (gateway) from "write efficiently" (collector) simplifies each component.
-
Test under load: We added load tests (
test_telemetry_buffering.py) that simulate 10x traffic spikes. These catch issues before production.
Testing the Buffering Layer
We created a Python module (backend/core/telemetry/buffering.py) that simulates the OTel Gateway behavior for testing:
from backend.core.telemetry.buffering import OTelGatewayBuffer, TelemetrySpike
buffer = OTelGatewayBuffer(
memory_limit_mib=1800,
spike_limit_mib=400,
batch_timeout_seconds=5,
batch_size=8192,
)
spike = TelemetrySpike(
spans_per_second=1000, # 10x normal
duration_seconds=60,
)
metrics = await buffer.simulate_load(spike)
assert metrics.success_rate >= 0.99
assert metrics.error_count == 0
This allows testing buffering behavior without running the full OTel stack.
What's Next
With the buffering layer in place, we can now:
- Add Kafka for even larger buffering capacity
- Implement tiered storage (hot/cold) in ClickHouse
- Add per-signal retention policies
- Build dashboards for gateway health monitoring
This post details the implementation of ADR-043: ClickStack Integration, resolving Issue #451.