ClickHouse Buffering Layer: Preventing "Too Many Parts" Crashes

February 10, 2026 · 5 min read

Published: 2025-01-16

When the LLM Council reviewed our ClickStack integration (ADR-043), they identified a critical flaw: we were sending telemetry data directly from applications to ClickHouse without any buffering layer. The verdict was CONDITIONAL APPROVAL until we addressed this—and for good reason.

This post details how we addressed Issue #451 by implementing an OTel Gateway buffering layer, transforming our telemetry pipeline from a ticking time bomb into a resilient, production-ready system.

The Problem

Our original architecture was deceptively simple:

[Apps] --OTLP--> [Collector:4317] --> [ClickHouse]

This looks clean, but under traffic spikes (incident response, deployment waves, load tests), the direct ingestion path creates serious problems:

"Too Many Parts" Errors: ClickHouse uses MergeTree storage that creates "parts" for each insert. Too many small inserts overwhelm the merge process.
Memory Exhaustion: The collector has no backpressure mechanism—it accepts data as fast as apps can send it, potentially causing OOM crashes.
No Retry Safety: When ClickHouse briefly becomes unavailable, data is lost.
Cascade Failures: A spike in one service's telemetry can crash the entire observability pipeline.

The Council specifically called out:

"Direct OTLP→ClickHouse allows ingestion spikes to crash DB ('Too many parts'). Insert Kafka or aggregated OTel Collector Gateway between apps and ClickHouse."

The Solution

We implemented a two-tier OTel Collector architecture with a dedicated gateway:

[Apps] --OTLP--> [Gateway:4317] --OTLP--> [Collector:4319] --> [ClickHouse]

Gateway Layer

The gateway sits between applications and the ClickHouse-writing collector, providing:

Memory Limiting: Prevents OOM under traffic spikes
Batching: Aggregates small payloads into efficient batches
Backpressure: Refuses data when overwhelmed (graceful degradation)

Collector Layer

The collector focuses on ClickHouse-optimized writes:

Large Batches: 10,000 rows per insert (vs 1,024 before)
Retry Logic: Exponential backoff for transient failures
Internal Only: Not exposed externally—only gateway connects

Implementation

Gateway Configuration

The gateway (gateway.yaml) is the first line of defense:

processors:
  # Memory limiter - MUST be first processor
  # limit_mib is the HARD limit; spike_limit_mib is subtracted
  memory_limiter:
    check_interval: 100ms
    limit_mib: 1536         # Hard limit (fits in 2GB container)
    spike_limit_mib: 256    # Soft limit at 1280 MiB

  # Batch processor - aggregates for efficient writes
  batch:
    timeout: 5s              # Max time to wait
    send_batch_size: 8192    # Target batch size
    send_batch_max_size: 16384

exporters:
  otlp:
    endpoint: otel-collector:4319
    sending_queue:
      queue_size: 500       # Conservative queue size (~4 MB)
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

The memory_limiter is critical—it tracks memory usage and applies backpressure when limits are reached. Understanding OTel's memory model is key: limit_mib is the hard limit (data refused above this), while spike_limit_mib defines a "buffer zone" (soft limit = limit_mib - spike_limit_mib). Backpressure kicks in at the soft limit; data is refused at the hard limit.

Collector Configuration

The collector (collector.yaml) optimizes for ClickHouse:

processors:
  batch:
    timeout: 10s               # Longer timeout for larger batches
    send_batch_size: 10000     # ClickHouse handles large batches well
    send_batch_max_size: 20000

exporters:
  clickhouse:
    endpoint: tcp://clickhouse:9000?dial_timeout=10s&compress=lz4
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

The larger batch size (10,000 vs 1,024) dramatically reduces the number of parts created in ClickHouse.

Docker Compose

The docker-compose.yml wires it together:

services:
  otel-gateway:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-gateway.yaml"]
    ports:
      - "4317:4317"   # Apps connect here
      - "4318:4318"
    deploy:
      resources:
        limits:
          memory: 2G  # Container memory limit

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector.yaml"]
    # Note: 4317 NOT exposed - gateway handles external connections
    expose:
      - "4319"        # Internal only

Key points:

Gateway exposes ports 4317/4318 (apps connect here)
Collector does NOT expose 4317 externally (internal only)
Gateway has explicit memory limits (2GB container)

Results

Before: Direct Ingestion

Traffic Spike (10x) → Collector overwhelmed → ClickHouse "Too many parts" → Data loss

After: Buffered Ingestion

Traffic Spike (10x) → Gateway buffers → Memory limit applies backpressure →
Collector batches → ClickHouse receives ~1/10th the insert count → Stable

Key Metrics

Metric	Before	After
Inserts per second	1000+	~100 (batched)
Parts created per minute	High	Low
Spike survival	Crash	Graceful degradation
Data loss during spikes	Yes	< 1%

Lessons Learned

Buffering is not optional: For any high-volume ingestion, a buffering layer is essential. The "simple" direct path is a trap.
Memory limits require thought: Setting limit_mib too low causes unnecessary data loss; too high risks OOM. Monitor and adjust.
Batch size matters for ClickHouse: ClickHouse performance depends heavily on insert size. Larger batches = fewer parts = better performance.
Two-tier architecture works: Separating "absorb spikes" (gateway) from "write efficiently" (collector) simplifies each component.
Test under load: We added load tests (test_telemetry_buffering.py) that simulate 10x traffic spikes. These catch issues before production.

Testing the Buffering Layer

We created a Python module (backend/core/telemetry/buffering.py) that simulates the OTel Gateway behavior for testing:

from backend.core.telemetry.buffering import OTelGatewayBuffer, TelemetrySpike

buffer = OTelGatewayBuffer(
    memory_limit_mib=1800,
    spike_limit_mib=400,
    batch_timeout_seconds=5,
    batch_size=8192,
)

spike = TelemetrySpike(
    spans_per_second=1000,  # 10x normal
    duration_seconds=60,
)

metrics = await buffer.simulate_load(spike)
assert metrics.success_rate >= 0.99
assert metrics.error_count == 0

This allows testing buffering behavior without running the full OTel stack.

What's Next

With the buffering layer in place, we can now:

Add Kafka for even larger buffering capacity
Implement tiered storage (hot/cold) in ClickHouse
Add per-signal retention policies
Build dashboards for gateway health monitoring

This post details the implementation of ADR-043: ClickStack Integration, resolving Issue #451.

The Problem​

The Solution​

Gateway Layer​

Collector Layer​

Implementation​

Gateway Configuration​

Collector Configuration​

Docker Compose​

Results​

Before: Direct Ingestion​

After: Buffered Ingestion​

Key Metrics​

Lessons Learned​

Testing the Buffering Layer​

What's Next​