Introduction

Part 1 compared the architectural foundations of Kafka and Pulsar — storage models, subscription types, multi-tenancy, and client design. That was theory. This post puts those differences under load.

We will run the same workloads against both brokers on identical hardware, using a reactive benchmark framework built with Project Reactor. The goal is not to crown a winner — it is to see where architectural decisions become measurable. Specifically, we will look at:

  • Producer throughput — how each broker handles parallel producers and how well throughput scales with concurrency.
  • Consumer scaling — what happens when you add consumers beyond the partition count on a single-partition topic.

All benchmarks, Docker configurations, and results are available in the kafka-vs-pulsar repository.


Test Environment

Both brokers run as Docker containers on the same host with identical resource constraints. This is a deliberate choice — we are benchmarking client-broker interaction patterns, not production-scale cluster performance.

      graph LR
    subgraph "Test Client (macOS)"
        TC["Benchmark Runner<br/><i>Project Reactor</i><br/><i>Java 25</i>"]
    end

    subgraph "Docker Host (ali-alienware)"
        HW["Intel i9-14900HX<br/><i>24 cores / 32 threads</i><br/><i>32 GB DDR5</i>"]
        K["Kafka 3.7.1<br/><i>KRaft mode</i><br/><i>8 GB / 4 CPU</i>"]
        P["Pulsar 3.3.4<br/><i>Standalone</i><br/><i>8 GB / 4 CPU</i>"]
    end

    TC -- "LAN" --> K
    TC -- "LAN" --> P
    

Host Hardware

Parameter Value
CPU Intel Core i9-14900HX (24 cores / 32 threads)
L3 Cache 36 MB
RAM 32 GB DDR5
OS Ubuntu Linux (kernel 6.x)

Both brokers run as Docker containers on this host, each pinned to 4 CPU cores and 8 GB of memory. The host has headroom to spare — the containers are not competing for resources.

Container Configuration

Parameter Kafka Pulsar
Image apache/kafka:3.7.1 apachepulsar/pulsar:3.3.4
Mode KRaft (no ZooKeeper) Standalone (embedded BookKeeper + ZK)
Memory limit 8 GB 8 GB
CPU limit 4 cores 4 cores
JVM heap -Xmx6g -Xms2g -Xmx6g -Xms6g
GC G1GC, 20ms pause target G1GC, 20ms pause target
IO threads 8 8
Network threads 8 8 (ordered executor)
Replication factor 1 ensemble=1, write/ack quorum=1
Flush Every message (log.flush.interval.messages=1) Journal sync enabled

The Docker Compose configuration explicitly matches durability guarantees — both brokers flush every message to disk, both use G1GC with the same pause target, both get the same thread counts. This is not how you would configure either system for production, but it ensures the comparison is fair.

# Kafka — key settings
KAFKA_LOG_FLUSH_INTERVAL_MESSAGES: 1
KAFKA_NUM_IO_THREADS: 8
KAFKA_NUM_NETWORK_THREADS: 8
KAFKA_HEAP_OPTS: "-Xmx6g -Xms2g -XX:+UseG1GC -XX:MaxGCPauseMillis=20"

# Pulsar — matched settings
PULSAR_PREFIX_journalSyncData: "true"
PULSAR_PREFIX_numIOThreads: "8"
PULSAR_PREFIX_numOrderedExecutorThreads: "8"
PULSAR_MEM: "-Xmx6g -Xms6g -XX:MaxDirectMemorySize=512m"
PULSAR_GC: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20"

Benchmark Parameters

Parameter Value
Total messages 1,000
Message size 1,024 bytes
Partitions 1
Thread steps 1, 6, 11, 16
Unique keys 100

One partition is intentional. It isolates the scaling behavior we want to measure — particularly the consumer scaling ceiling that Kafka’s partition model imposes.


The Benchmark Framework

Before diving into results, it is worth understanding how the benchmarks are structured. The framework is built on Project Reactor with a shared core and broker-specific implementations.

      graph LR
    subgraph kafka[" kafka-benchmark "]
        KP["ReactiveKafka<br/>MessageProducer"]
        KC["ReactiveKafka<br/>MessageConsumer"]
        KT["KafkaProducer<br/>ThroughputBenchmark"]
        KS["KafkaConsumer<br/>ScalingBenchmark"]
    end

    subgraph core[" core "]
        PI["ReactiveMessage<br/>Producer"]
        CI["ReactiveMessage<br/>Consumer"]
        BR["Benchmark<br/>Runner"]
        BC["Benchmark<br/>Config"]
    end

    subgraph pulsar[" pulsar-benchmark "]
        PP["ReactivePulsar<br/>MessageProducer"]
        PC["ReactivePulsar<br/>MessageConsumer"]
        PT["PulsarProducer<br/>ThroughputBenchmark"]
        PS["PulsarConsumer<br/>ScalingBenchmark"]
    end

    KP --> PI
    KC --> CI
    KT --> BR
    KS --> BR
    PP --> PI
    PC --> CI
    PT --> BR
    PS --> BR
    

Reactive Abstractions

Both producers and consumers implement a common reactive interface:

public interface ReactiveMessageProducer extends AutoCloseable {
    Mono<Void> send(BenchmarkMessage message);
    Flux<Void> sendMany(Flux<BenchmarkMessage> messages);
    Mono<Void> createTopic(String topic, int partitions);
    void close();
}

public interface ReactiveMessageConsumer extends AutoCloseable {
    Flux<BenchmarkMessage> consume();
    void close();
}

The Kafka implementation wraps reactor-kafka:

public class ReactiveKafkaMessageProducer implements ReactiveMessageProducer {

    private final KafkaSender<String, byte[]> sender;

    @Override
    public Mono<Void> send(BenchmarkMessage message) {
        ProducerRecord<String, byte[]> record =
            new ProducerRecord<>(topic, message.getKey(), message.serialize());
        SenderRecord<String, byte[], String> senderRecord =
            SenderRecord.create(record, message.getId());

        return sender.send(Mono.just(senderRecord)).then();
    }
}

The Pulsar implementation wraps reactive-pulsar:

public class ReactivePulsarMessageProducer implements ReactiveMessageProducer {

    private final ReactiveMessageSender<byte[]> sender;

    @Override
    public Mono<Void> send(BenchmarkMessage message) {
        return sender.sendOne(MessageSpec.builder(message.serialize())
                                         .key(message.getKey())
                                         .build())
                     .then();
    }
}

Same Mono<Void> return type, same reactive backpressure semantics, different underlying client libraries. The BenchmarkRunner orchestrates execution using these abstractions, so the benchmark logic is identical for both brokers.

How the Runner Works

The producer benchmark creates N parallel producers and distributes messages evenly across them, using flatMap with a concurrency of 256 to maintain backpressure:

Flux.fromIterable(producers)
    .flatMap(producer -> Flux.range(0, messagesPerProducer)
        .map(i -> BenchmarkMessage.create("key-" + (i % 100), config.getMessageSizeBytes())
                                   .withProducedAtNanos(System.nanoTime()))
        .flatMap(msg -> {
            long sendStart = System.nanoTime();
            return producer.send(msg)
                           .doOnSuccess(v -> {
                               metrics.recordMessageSent();
                               metrics.recordProduceLatency(System.nanoTime() - sendStart);
                               sentCount.incrementAndGet();
                           });
        }, 256),
    config.getParallelProducers())
    .blockLast(config.getTestTimeout());

The consumer scaling benchmark pre-produces all messages, then starts N parallel consumers on boundedElastic schedulers and waits for all messages to be consumed via a CountDownLatch:

List<ReactiveMessageConsumer> consumers = IntStream.range(0, config.getParallelConsumers())
                                                   .mapToObj(i -> consumerFactory.get())
                                                   .toList();

Disposable.Composite disposable = Disposables.composite();

consumers.forEach(consumer -> {
    Disposable d = consumer.consume()
                           .subscribeOn(Schedulers.boundedElastic())
                           .subscribe(msg -> {
                               metrics.recordMessageReceived();
                               metrics.recordE2eLatency(System.nanoTime() - msg.getProducedAtNanos());
                               receivedCount.incrementAndGet();
                               latch.countDown();
                           });
    disposable.add(d);
});

Producer Throughput

The first benchmark measures raw producer throughput: how fast can each broker accept messages from an increasing number of parallel producers?

Results

Producers Kafka (msg/sec) Kafka (MB/sec) Pulsar (msg/sec) Pulsar (MB/sec)
1 4,239 4.14 3,997 3.90
6 5,444 5.32 7,970 7.78
11 4,655 4.55 10,203 9.96
16 4,405 4.30 8,803 8.60
      graph LR
    subgraph "Producer Throughput (msg/sec)"
        direction TB
        K1["Kafka × 1<br/><b>4,239</b>"]
        K6["Kafka × 6<br/><b>5,444</b> (peak)"]
        K16["Kafka × 16<br/><b>4,405</b>"]
        P1["Pulsar × 1<br/><b>3,997</b>"]
        P11["Pulsar × 11<br/><b>10,203</b> (peak)"]
        P16["Pulsar × 16<br/><b>8,803</b>"]
    end

    K1 -. "1.28× peak scaling" .-> K6
    P1 -. "2.55× peak scaling" .-> P11
    

Analysis

With a single producer, both brokers perform in the same range — Kafka is marginally faster (4,239 vs 3,997 msg/sec). The more interesting story is how each broker responds to increasing concurrency.

Kafka peaks at 6 producers (5,444 msg/sec) — a 1.28x improvement over the single-producer baseline. Beyond that, throughput actually drops back toward the single-producer level. All producers are writing to the same single partition, and Kafka serializes writes at the partition leader. At low concurrency, more producers provide more batching opportunities. Past the inflection point, the contention on the single partition’s write path starts to dominate.

Pulsar peaks at 11 producers (10,203 msg/sec) — a 2.55x improvement. Pulsar continues to scale well past the point where Kafka plateaus. At 16 producers, Pulsar still delivers 8,803 msg/sec — roughly 2x Kafka’s throughput at the same concurrency level. This comes from Pulsar’s architecture: the broker does not own the storage. It dispatches writes to BookKeeper, which distributes segments across bookies. Even with a single topic, Pulsar’s write path can absorb more concurrent producers because the broker acts as a routing layer, not a storage engine.

The scaling curves tell the story. Kafka’s throughput curve is essentially flat — it peaks early and cannot absorb additional producer concurrency. Pulsar’s curve rises further before bending, reflecting its ability to pipeline writes through the separated compute-storage architecture discussed in Part 1.


Consumer Scaling

The consumer scaling benchmark exposes one of Kafka’s most significant architectural constraints: the number of active consumers is bounded by the number of partitions.

Results

Consumers Kafka (msg/sec) Kafka Duration Pulsar (msg/sec) Pulsar Duration
1 297 3.37s 8,657 0.12s
6 159 6.31s 14,304 0.07s
11 158 6.33s 11,817 0.08s
16 160 6.24s 16,668 0.06s

All 1,000 messages were received in every configuration. For Kafka, only 1 of N consumers was active due to the single partition — the rest sat idle after rebalancing. For Pulsar, all N consumers were active via Shared subscription. The difference is dramatic — not in delivery, but in speed. Pulsar consumed all messages in 60–120 milliseconds. Kafka took 3–6 seconds.

      graph LR
    subgraph "Consumer Scaling — Throughput (msg/sec)"
        direction TB
        K1["Kafka × 1<br/><b>297 msg/sec</b><br/><i>3.37s</i>"]
        K16["Kafka × 16<br/><b>160 msg/sec</b><br/><i>6.24s</i>"]
        P1["Pulsar × 1<br/><b>8,657 msg/sec</b><br/><i>0.12s</i>"]
        P16["Pulsar × 16<br/><b>16,668 msg/sec</b><br/><i>0.06s</i>"]
    end

    K1 -. "adding consumers<br/>makes it slower" .-> K16
    P1 -. "1.93× scaling<br/>with Shared sub" .-> P16
    

With 16 consumers, Pulsar achieves a 1.93x throughput improvement over a single consumer — all 16 participate in message delivery via round-robin. Kafka’s throughput does not improve — it actually drops by nearly half. Adding consumers to a 1-partition topic makes Kafka slower because the consumer group rebalancing protocol must coordinate all 16 consumers, only to assign the single partition to one of them.

The Kafka Ceiling

In Kafka, each partition in a topic is assigned to exactly one consumer in a consumer group. If you have 1 partition and 16 consumers, only 1 consumer will receive messages — the other 15 sit idle after the rebalance completes. The benchmark makes this explicit:

// From KafkaConsumerScalingBenchmark
int effectiveConsumers = Math.min(threads, partitions);

log.info("Kafka consumer scaling: requested={}, effective={} (partitions={})",
        threads, effectiveConsumers, partitions);

The actual test output confirms the partition assignment. With 16 consumers on 1 partition, only one consumer gets assigned:

Consumer [group=scaling-group-092d7fdc] assigned 0 partitions for scaling-16-2b5634a4
Consumer [group=scaling-group-092d7fdc] assigned 0 partitions for scaling-16-2b5634a4
Consumer [group=scaling-group-092d7fdc] assigned 1 partitions for scaling-16-2b5634a4  // only this one
Consumer [group=scaling-group-092d7fdc] assigned 0 partitions for scaling-16-2b5634a4
... (12 more with 0 partitions)

The rebalancing itself is expensive — the benchmark shows ~6 seconds of total duration at 6+ consumers, compared to 3.4 seconds with a single consumer. The consumer group protocol coordinates all members, even when only one can receive messages. The extra consumers are pure overhead.

      graph TD
    subgraph "Kafka — 1 partition, 16 consumers"
        T_K["Topic<br/><i>1 partition</i>"]
        T_K --> C1_K["Consumer 1<br/><i>ACTIVE — 1 partition</i>"]
        T_K -. "idle — 0 partitions" .-> C2_K["Consumer 2"]
        T_K -. "idle — 0 partitions" .-> C3_K["Consumer 3"]
        T_K -. "idle — 0 partitions" .-> C_REST_K["Consumers 4–16<br/><i>all idle</i>"]
    end
    

The only way to increase consumer parallelism in Kafka is to increase the partition count — a heavy operational change that requires repartitioning the topic, redistributing data, and may affect ordering guarantees.

The Pulsar Approach

Pulsar decouples consumer scaling from the partition count through its Shared subscription model. With a Shared subscription, the broker distributes messages round-robin across all connected consumers, regardless of how many partitions the topic has.

// From PulsarConsumerScalingBenchmark
BenchmarkResult result = createRunner("consumer-scaling").runConsumerScalingBenchmark(
        "consumer-scaling-t" + threads,
        config,
        () -> new ReactivePulsarMessageProducer(pulsarConfig, topic),
        () -> new ReactivePulsarMessageConsumer(pulsarConfig, topic, subscription,
                                                SubscriptionType.Shared)
);

log.info("Pulsar consumer scaling: threads={} (ALL active)", threads);

With 1 partition and 16 consumers, all 16 consumers receive messages. No idle consumers. No rebalancing overhead. No repartitioning required.

      graph TD
    subgraph "Pulsar Shared — 1 partition, 16 consumers"
        T_P["Topic<br/><i>1 partition</i>"]
        T_P --> C1_P["Consumer 1<br/><i>ACTIVE</i>"]
        T_P --> C2_P["Consumer 2<br/><i>ACTIVE</i>"]
        T_P --> C3_P["Consumer 3<br/><i>ACTIVE</i>"]
        T_P --> C_REST_P["Consumers 4–16<br/><i>ALL ACTIVE</i>"]
    end
    

Why This Matters

The benchmark results make the architectural difference concrete. Even with a single consumer, Pulsar’s throughput is 29x higher than Kafka’s (8,657 vs 297 msg/sec). At 16 consumers, the gap widens to over 100x (16,668 vs 160 msg/sec) — and Kafka’s throughput actually drops as you add consumers.

The per-consumer throughput gap is partly explained by the client consumption models. Pulsar’s reactive consumer uses consumeMany which processes messages in a streaming pipeline with acknowledgments, while Kafka’s reactive consumer uses poll-based KafkaReceiver with seek-to-beginning on partition assignment. But the scaling behavior is entirely architectural — more consumers cannot help Kafka on a single partition, while Pulsar distributes work across all of them.

In practice, you often discover that you need more consumer parallelism after the system is already running with a fixed partition count. In Kafka, this means:

  1. Create a new topic with more partitions.
  2. Migrate producers and consumers to the new topic.
  3. Or increase partitions on the existing topic (but Kafka does not redistribute existing data — only new messages use the new partitions).

In Pulsar, you just start more consumers. The partition count is irrelevant to consumer scaling when using a Shared subscription.


Observations and Takeaways

Producer throughput scales differently

At single-producer throughput, both brokers are competitive. The separation of compute and storage in Pulsar pays off when you need concurrent producers — the 2.55x peak scaling factor vs Kafka’s 1.28x is significant. Kafka’s throughput peaks early (at 6 producers) and declines with further concurrency. Pulsar continues to scale well past that point, peaking at 11 producers. If your workload involves many services producing to the same topic, Pulsar’s architecture handles this more efficiently.

Consumer parallelism is a design-time decision in Kafka

In Kafka, you must decide your maximum consumer parallelism when you create the topic. The partition count is effectively a scaling ceiling for consumers. Our benchmark demonstrated this concretely — with 1 partition, adding consumers not only fails to improve throughput, it actively degrades it from 297 to 160 msg/sec due to consumer group rebalancing overhead.

In Pulsar, consumer parallelism is a runtime decision. You choose a subscription type (Shared, Key_Shared, Exclusive, Failover) and start as many consumers as you need. The topic does not need to know or care. With 16 consumers on a single partition, Pulsar delivered 1.93x the throughput of a single consumer.

The throughput gap is wider than the scaling gap

Even before we look at scaling behavior, Pulsar’s single-consumer throughput in the consumer benchmark was 29x higher than Kafka’s. This points to fundamental differences in the consumption path — Pulsar’s reactive consumer processes messages in a streaming pipeline, while Kafka’s consumer group protocol adds coordination overhead even for a single consumer. In a benchmark with a single partition, this gap dominates the results.

Caveats

These benchmarks run on a single-broker, single-machine setup with 1,000 messages per test. They are designed to isolate client-broker interaction patterns, not to represent production cluster performance. Specifically:

  • Replication factor 1 — production systems use 3. Replication affects both throughput and latency.
  • Single partition — intentional for consumer scaling tests, but real workloads use multiple partitions for parallelism in Kafka.
  • 1,000 messages — sufficient to expose scaling patterns, but production workloads involve millions. Larger message volumes would amortize startup and rebalancing costs differently.
  • No disk contention — both brokers have dedicated volumes. In shared infrastructure, IO patterns matter more.
  • Flush-every-message — matches durability semantics but penalizes both brokers compared to batched flush configurations.
  • LAN latency — the test client runs on macOS over LAN to the Docker host. Co-located clients would reduce network variance.

What Is Coming Next

Future parts will expand the benchmark suite to cover:

  • Key_Shared subscription — Pulsar’s unique per-key ordering with flexible consumer scaling, which has no Kafka equivalent.
  • End-to-end latency — produce-to-consume latency under varying load, including tail latency (p99, p99.9).
  • Multi-partition scaling — throughput curves as partition counts increase from 1 to 64+.
  • Backpressure behavior — how each client handles producer backpressure when the broker is saturated.

References

  1. Apache Kafka Documentation
  2. Apache Pulsar Documentation
  3. Reactor Kafka
  4. Reactive Pulsar Client
  5. Pulsar Subscription Types
  6. Kafka Consumer Group Protocol
  7. Benchmark Source Code