Common Performance Issues

Understanding common performance issues with message brokers is essential for effective load testing and optimization. This guide catalogs the most frequent performance problems you'll encounter with Kafka, RabbitMQ, and other message brokers, helping you identify bottlenecks during load testing and in production.

Why This Matters for Load Testing

When load testing message broker integrations with Speedscale, you'll want to watch for these issues as you scale traffic. Many performance problems only appear under realistic load patterns, making load testing critical for uncovering bottlenecks before they impact production.

1. Broker-Side Resource Saturation

Disk I/O bottlenecks

Kafka: Sequential writes and page cache behavior dominate performance. Frequent fsyncs or insufficient RAM cause latency spikes.
RabbitMQ: Durable queues and persistent messages can become disk-bound.
Symptoms: Rising publish latency, consumer lag, periodic latency cliffs.

Network saturation

Small messages increase packet and syscall overhead.
Cross-AZ/region replication traffic can dominate bandwidth.
Symptoms: Throughput plateaus, p99 latency worsens.

CPU pressure

TLS, compression, routing, and message copying all consume CPU.
RabbitMQ topic/fanout exchanges with many bindings are CPU-intensive.

Memory pressure and GC

JVM (Kafka) and Erlang VM (RabbitMQ) suffer from GC pauses under memory pressure.
Symptoms: Latency spikes, uneven throughput.

2. Durability and Replication Tradeoffs

Ack and durability settings

Kafka: acks=all, high min.insync.replicas increase safety but hurt tail latency.
RabbitMQ: Publisher confirms, durable queues, persistent messages add overhead.

Replication catch-up

Lagging replicas consume bandwidth and disk I/O during recovery.

3. Partition and Queue Topology

Too few partitions or hot spots (Kafka)

Limits parallelism; skewed keys overload single brokers.

Too many partitions or queues

Metadata, file handles, and background work explode.
Slower failover and rebalances.

4. Consumer-Side Backpressure

Slow consumers

Processing slower than production rate causes lag spirals.

Rebalance storms (Kafka)

Frequent consumer group rebalances cause pauses.

Prefetch and unacked tuning (RabbitMQ)

High prefetch hoards messages; low prefetch limits throughput.

5. Message Shape and Client Behavior

Message size extremes

Large messages cause GC and network fragmentation.
Tiny messages waste per-message overhead.

Retry storms

Aggressive retries amplify load and cause cascading failures.

Connection churn

Frequent connection or channel creation is expensive.

6. Serialization and Schema Management

Serialization overhead

JSON vs Binary Formats: JSON is human-readable but verbose and slow to parse. Binary formats like Avro, Protocol Buffers, or MessagePack reduce payload size and CPU overhead.
Schema evolution: Incompatible schema changes break consumers. Use schema registries (Kafka Schema Registry, Confluent Schema Registry) to manage compatibility.
Deserialization CPU cost: Complex schemas with deep nesting increase CPU usage per message.

Schema registry issues

Registry unavailability: Schema registry outages can block producers and consumers.
Cache tuning: Insufficient schema cache causes excessive registry lookups.
Version proliferation: Too many schema versions increases registry overhead.

7. Security and Authentication Overhead

TLS/SSL encryption

CPU overhead from encryption/decryption, especially at high throughput.
Certificate validation adds latency to connection establishment.
Symptoms: High CPU usage, reduced throughput compared to plaintext.

SASL authentication

Kafka: SASL/SCRAM, SASL/PLAIN, Kerberos add authentication overhead.
RabbitMQ: EXTERNAL, PLAIN mechanisms impact connection setup time.
Token refresh and re-authentication can cause periodic latency spikes.

ACL evaluation

Complex ACL rules increase authorization overhead per request.
Can become a bottleneck with fine-grained permissions.

8. Monitoring and Observability Gaps

Missing or inadequate metrics

Can't optimize what you can't measure.
Key metrics: consumer lag, throughput, latency percentiles (p50, p95, p99), error rates.

Alert fatigue

Too many alerts or poorly tuned thresholds mask real issues.
Missing alerts for critical conditions (high consumer lag, broker resource exhaustion).

Lack of distributed tracing

Hard to correlate message flow across services.
Difficult to identify bottlenecks in multi-hop message flows.

9. Testing and Load Testing Challenges

Insufficient load testing

Many performance issues only appear under realistic production load.
Testing in lower environments with reduced load may miss bottlenecks.
Solution: Use tools like Speedscale to replay production traffic patterns in test environments.

Environment differences

Configuration drift between environments masks issues.
Test brokers with different hardware specs don't reflect production behavior.
Network topology differences affect latency and throughput.

Unrealistic test data

Synthetic data may not match production message sizes or complexity.
Missing edge cases (large messages, malformed messages, bursty traffic).

Load Testing with Speedscale

Speedscale helps overcome these challenges by capturing real production traffic patterns and replaying them in test environments. This ensures your load tests reflect actual usage patterns, message sizes, and timing.

Learn how to load test message brokers with Speedscale

10. Operational Issues

Dead letter queue handling

Messages that repeatedly fail processing accumulate in DLQs.
Unmonitored DLQs can hide data loss or processing failures.
Symptoms: Silent data loss, missing messages downstream.

Poison messages

A single malformed or problematic message can stall an entire queue or partition.
Consumers may crash or hang when processing poison messages.
Solution: Implement proper error handling, message validation, and DLQ strategies.

Connection pool exhaustion

Limited connection pools cause contention under load.
Connection leaks gradually exhaust available connections.
Symptoms: Connection timeouts, blocked threads, cascading failures.

Clock skew and time synchronization

Timestamp inconsistencies affect message ordering and retention.
Can cause messages to be dropped or processed out of order.
Solution: Use NTP or similar time synchronization across all brokers and clients.

11. Broker-Specific Common Issues

Kafka

Leader imbalance: Uneven distribution of partition leaders causes hot brokers.
ISR shrink: In-sync replica set shrinking indicates replication lag or broker issues.
Network/request thread saturation: Limited handler threads bottleneck request processing.
Page cache misses: Reading cold data from disk instead of cache hurts performance.
Log compaction pressure: Compaction falling behind increases storage and read overhead.

RabbitMQ

Memory and disk alarms: Triggering flow control blocks publishers, causing cascading backpressure.
Exchange and binding explosions: Too many bindings increase routing overhead.
Queue type tradeoffs: Classic vs quorum vs stream queues have different performance characteristics.
Too many queues or consumers per node: Resource overhead per queue/consumer limits scalability.

Identifying Issues During Load Testing

When load testing with Speedscale, watch for these indicators:

Latency increases: p95/p99 latency spikes under load indicate bottlenecks
Consumer lag: Growing lag suggests processing can't keep up with production rate
Error rate increases: Timeouts, connection failures, or processing errors
Resource exhaustion: CPU, memory, disk I/O, or network saturation
Throughput plateaus: Inability to scale beyond certain message rates

Use Speedscale's captured traffic to create realistic load scenarios that expose these issues before they impact production.

Recommendations

Start with production traffic patterns: Use Speedscale to capture and replay real traffic
Monitor comprehensively: Track broker, client, and application metrics
Test at scale: Don't assume lower-environment results will hold in production
Tune iteratively: Adjust broker, client, and application configs based on load test results
Plan for failure: Test broker failures, network partitions, and client crashes
Document baselines: Establish performance baselines to detect degradation

Why This Matters for Load Testing​

1. Broker-Side Resource Saturation​

Disk I/O bottlenecks​

Network saturation​

CPU pressure​

Memory pressure and GC​

2. Durability and Replication Tradeoffs​

Ack and durability settings​

Replication catch-up​

3. Partition and Queue Topology​

Too few partitions or hot spots (Kafka)​

Too many partitions or queues​

4. Consumer-Side Backpressure​

Slow consumers​

Rebalance storms (Kafka)​

Prefetch and unacked tuning (RabbitMQ)​

5. Message Shape and Client Behavior​

Message size extremes​

Retry storms​

Connection churn​

6. Serialization and Schema Management​

Serialization overhead​

Schema registry issues​

7. Security and Authentication Overhead​

TLS/SSL encryption​

SASL authentication​

ACL evaluation​

8. Monitoring and Observability Gaps​

Missing or inadequate metrics​

Alert fatigue​

Lack of distributed tracing​

9. Testing and Load Testing Challenges​

Insufficient load testing​

Environment differences​

Unrealistic test data​

10. Operational Issues​

Dead letter queue handling​

Poison messages​

Connection pool exhaustion​

Clock skew and time synchronization​

11. Broker-Specific Common Issues​

Kafka​

RabbitMQ​

Identifying Issues During Load Testing​

Recommendations​