Introduction

Kafka has been the default choice for event streaming for over a decade. It earned that position — it solved real problems at LinkedIn-scale when nothing else could, and the ecosystem built around it is massive. If you are building a new system today and someone says “use Kafka,” that is not bad advice.

But it is incomplete advice.

Apache Pulsar was created at Yahoo in 2013 to solve problems that Kafka’s architecture made structurally difficult: multi-tenancy across thousands of topics, independent scaling of storage and compute, and geo-replication as a first-class feature rather than an afterthought. It was open-sourced in 2016 and became a top-level Apache project in 2018.

This is not a “Pulsar is better than Kafka” article. Both systems make deliberate trade-offs. The goal of Part 1 is to compare the architectural foundations — the design decisions that shape everything else. Part 2 will get into benchmarks, client behavior under load, and production tuning.


The Fundamental Architectural Difference

The single most important difference between Kafka and Pulsar is how they handle storage.

Kafka couples compute and storage. Each broker owns a set of partitions, and the data for those partitions lives on the broker’s local disk. The broker that serves reads and writes for a partition is the same machine that stores the data.

Pulsar separates compute and storage. Brokers handle the messaging protocol — producers, consumers, subscriptions, routing — but they do not store data. Storage is delegated to Apache BookKeeper, a distributed log storage system that runs as a separate cluster.

This is not a minor implementation detail. It is the architectural decision from which nearly every other difference follows.

  graph LR
    subgraph Kafka Architecture
        direction LR
        P1K["Producer"] --> B1K["Broker 1<br/><i>serves + stores</i><br/>Partition 0, 1"]
        P2K["Producer"] --> B2K["Broker 2<br/><i>serves + stores</i><br/>Partition 2, 3"]
        P3K["Producer"] --> B3K["Broker 3<br/><i>serves + stores</i><br/>Partition 4, 5"]
        B1K --> C1K["Consumer"]
        B2K --> C2K["Consumer"]
        B3K --> C3K["Consumer"]
        B1K -. "replication" .-> B2K
        B2K -. "replication" .-> B3K
    end
  graph LR
    subgraph Pulsar Architecture
        direction LR
        P1P["Producer"] --> BR1["Broker 1<br/><i>serves only</i>"]
        P2P["Producer"] --> BR2["Broker 2<br/><i>serves only</i>"]
        BR1 --> C1P["Consumer"]
        BR2 --> C2P["Consumer"]
        BR1 --> BK["BookKeeper Cluster<br/><i>stores only</i>"]
        BR2 --> BK
        BK --> BK1["Bookie 1"]
        BK --> BK2["Bookie 2"]
        BK --> BK3["Bookie 3"]
    end

What This Means in Practice

In Kafka, when a broker goes down, the partitions it owned need to be reassigned. This triggers partition rebalancing — other brokers must take over leadership for those partitions and catch up from replicas. During this window, those partitions may be unavailable for writes. The recovery time depends on how much data needs to be replicated.

In Pulsar, when a broker goes down, another broker simply picks up ownership of the affected topics. There is no data to move — the data already lives in BookKeeper, replicated across multiple bookies. The new broker starts serving immediately by reading from BookKeeper. Recovery is near-instantaneous because it is a metadata operation, not a data operation.

This separation also means you can scale compute and storage independently:

Scenario Kafka Pulsar
Need more throughput? Add brokers (but must rebalance partitions and data) Add brokers (no data movement)
Need more storage? Add brokers or expand disks (triggers rebalancing) Add bookies (BookKeeper handles rebalancing internally)
Broker failure recovery Re-replicate partition data to new leader New broker takes ownership, reads from existing BookKeeper data
Storage failure recovery Broker is degraded until disk is replaced/replicated Bookie replaced independently, BookKeeper re-replicates segments

Storage Model — Partitions vs Segments

Kafka and Pulsar model their storage differently, and this has significant consequences for how data moves through the system.

Kafka — Partition-Centric Storage

In Kafka, the partition is the unit of storage, parallelism, and ordering. Each partition is an append-only log stored as a sequence of segment files on a single broker’s disk. Replicas of that partition exist on other brokers, but the leader handles all reads and writes.

This design creates a tight coupling between a partition and its broker. If you need to move a partition — because a broker is overloaded, because you are decommissioning hardware, because the cluster is imbalanced — you must physically copy the data from one broker to another. For large partitions (hundreds of gigabytes or more), this can take hours and consumes significant network bandwidth.

Pulsar — Segment-Centric Storage

Pulsar takes a different approach. Each topic (or partition of a partitioned topic) is divided into segments (also called ledgers in BookKeeper terminology). As a segment fills up, the broker closes it and opens a new one. Each segment is independently replicated across multiple bookies.

  graph TD
    subgraph "Pulsar Topic Storage"
        T["Topic"] --> S1["Segment 1<br/><i>closed</i>"]
        T --> S2["Segment 2<br/><i>closed</i>"]
        T --> S3["Segment 3<br/><i>active</i>"]
    end

    subgraph "BookKeeper"
        S1 --> BK1["Bookie 1"]
        S1 --> BK2["Bookie 2"]
        S2 --> BK2
        S2 --> BK3["Bookie 3"]
        S3 --> BK1
        S3 --> BK3
    end

The key insight is that segments are distributed across bookies, not pinned to a specific one. Segment 1 might live on bookies 1 and 2, while segment 2 lives on bookies 2 and 3. This means:

  • No hot spots. Write load is distributed across the bookie cluster at the segment level, not the topic level.
  • No rebalancing on broker failure. When a Pulsar broker goes down, only topic ownership moves — the data stays in BookKeeper.
  • Efficient catch-up reads. A consumer reading historical data can read from any bookie that holds the relevant segment, parallelizing reads across the storage cluster.

Multi-Tenancy

Kafka was designed as a single-tenant system. You can run multiple teams on the same Kafka cluster, but the isolation is limited. There is no built-in concept of namespaces, per-tenant resource quotas, or fine-grained access control at the topic level. Large organizations typically end up running multiple Kafka clusters — one per team or environment — which multiplies operational overhead.

Pulsar was designed from the ground up as a multi-tenant system. Its resource hierarchy reflects this:

  graph LR
    I["Pulsar Instance"] --> T1["Tenant: team-payments"]
    I --> T2["Tenant: team-analytics"]

    T1 --> N1["Namespace: production"]
    T1 --> N2["Namespace: staging"]

    T2 --> N3["Namespace: production"]
    T2 --> N4["Namespace: experiments"]

    N1 --> TP1["Topic: transactions"]
    N1 --> TP2["Topic: refunds"]
    N3 --> TP3["Topic: events"]
    N4 --> TP4["Topic: model-training"]

Each level in the hierarchy supports independent configuration:

Level What You Can Configure
Tenant Authentication, authorization, allowed clusters
Namespace Retention policies, TTL, replication clusters, schema enforcement, rate limits, storage quotas, backlog quotas
Topic Schema, compaction, partitions

This means you can run a single Pulsar cluster for your entire organization, with each team getting isolated namespaces that have their own retention policies, rate limits, and storage quotas. The payments team can retain 30 days of data with strict schema enforcement, while the analytics team retains 7 days with relaxed schemas — all on the same infrastructure.


Subscription Models

This is where Pulsar’s flexibility becomes most apparent. Kafka has a single consumption model: consumer groups. Pulsar offers four distinct subscription types, each designed for a different use case.

Kafka — Consumer Groups

In Kafka, consumers form consumer groups. Each partition within a topic is assigned to exactly one consumer in the group. This provides:

  • Ordering guarantees within a partition.
  • Parallel processing across partitions.
  • Automatic rebalancing when consumers join or leave.

But it also imposes constraints. The number of active consumers in a group is bounded by the number of partitions. If you have 10 partitions and 15 consumers, 5 consumers sit idle. If you want two independent applications to read the same data, you need two consumer groups — each maintaining its own offset state.

Pulsar — Four Subscription Types

Pulsar decouples the subscription model from the topic’s partition count:

  graph LR
    T["Topic"]

    subgraph "Exclusive"
        T --> E1["Consumer A<br/><i>receives all messages</i>"]
    end

    subgraph "Failover"
        T --> F1["Consumer A<br/><i>active</i>"]
        T -. "standby" .-> F2["Consumer B<br/><i>standby</i>"]
    end

    subgraph "Shared"
        T --> S1["Consumer A"]
        T --> S2["Consumer B"]
        T --> S3["Consumer C"]
    end

    subgraph "Key_Shared"
        T --> K1["Consumer A<br/><i>keys: a, c</i>"]
        T --> K2["Consumer B<br/><i>keys: b, d</i>"]
    end
Subscription Type Behavior Ordering Use Case
Exclusive One consumer per subscription. Fails if a second consumer tries to connect. Total ordering Event sourcing, audit logs
Failover One active consumer, others on standby. Automatic failover on disconnect. Total ordering (within active consumer) High-availability processing
Shared Messages distributed round-robin across consumers. No ordering guarantee. None High-throughput parallel processing
Key_Shared Messages with the same key always routed to the same consumer. Per-key ordering Stateful processing, per-user workflows

The Shared subscription is particularly powerful: you can scale consumers to any number, regardless of how many partitions the topic has. With Kafka, scaling consumers beyond the partition count requires repartitioning — a heavy operational task. With Pulsar’s shared subscription, you just start more consumers.

The Key_Shared subscription solves a common problem elegantly: processing events for the same entity (user, order, session) on the same consumer to maintain local state, without the partition-key coupling that Kafka requires.


Client Architecture

The way Kafka and Pulsar clients interact with the cluster reflects their different philosophies.

Kafka Client

The Kafka producer and consumer are heavyweight clients. They maintain metadata about the entire cluster topology — which brokers own which partitions, the controller broker, ISR (in-sync replica) lists. The client is responsible for:

  • Discovering partition leaders.
  • Routing messages to the correct broker based on the partition key.
  • Handling leader elections and metadata refreshes.
  • Managing consumer group coordination through the group coordinator protocol.

This means the Kafka client carries significant logic. It needs to understand the cluster topology to function correctly. If the cluster is behind a load balancer, the client must still be able to reach individual brokers directly — Kafka’s protocol requires direct broker connections for producing and consuming.

Pulsar Client

The Pulsar client is lightweight by comparison. It connects to any broker (or a load balancer in front of brokers), and the broker handles routing internally. The client does not need to know which broker owns which topic — it discovers this through a lookup protocol:

  1. Client connects to any broker and asks: “Where is topic X?”
  2. The broker responds with the address of the broker currently serving that topic.
  3. The client connects to that broker for producing/consuming.

This means Pulsar works naturally behind load balancers and proxies — a significant operational advantage in cloud and Kubernetes environments.

Async-First Design

The Pulsar client was built async-first. Every operation returns a CompletableFuture:

// Pulsar — async by default
CompletableFuture<MessageId> future = producer.sendAsync(message);

// Batching is built into the client
Producer<byte[]> producer = client.newProducer()
    .topic("my-topic")
    .batchingMaxMessages(1000)
    .batchingMaxPublishDelay(10, TimeUnit.MILLISECONDS)
    .sendTimeout(30, TimeUnit.SECONDS)
    .create();

Key client features that Pulsar provides out of the box:

  • Automatic batching. The client accumulates messages and sends them in batches, reducing network round trips. Batch size and delay are configurable.
  • Connection pooling. A single PulsarClient instance manages connections to all brokers, with configurable pool sizes.
  • Automatic reconnection. If a broker goes down, the client automatically reconnects and re-subscribes.
  • Dead letter topics. Failed messages are automatically routed to a dead letter topic after a configurable number of retries — no application code needed.
  • Negative acknowledgment. Consumers can nack a message to trigger redelivery after a configurable delay, enabling fine-grained retry logic.
// Dead letter policy — built into the consumer
Consumer<byte[]> consumer = client.newConsumer()
    .topic("my-topic")
    .subscriptionName("my-sub")
    .subscriptionType(SubscriptionType.Shared)
    .deadLetterPolicy(DeadLetterPolicy.builder()
        .maxRedeliverCount(3)
        .deadLetterTopic("my-topic-dlq")
        .build())
    .negativeAckRedeliveryDelay(10, TimeUnit.SECONDS)
    .subscribe();

Compare this with Kafka, where batching is configured at the producer level but dead letter handling, retry policies, and negative acknowledgment must be implemented in application code or through additional frameworks (like Spring Kafka’s error handlers).


Schema Registry and Evolution

Both Kafka and Pulsar support schema management, but with different approaches.

Kafka relies on the Confluent Schema Registry — a separate service that stores Avro, Protobuf, and JSON schemas. Producers and consumers communicate with the registry independently, and schema compatibility is enforced at the registry level. This works well but adds another service to deploy, monitor, and maintain.

Pulsar has a built-in schema registry that is part of the broker. Schemas are stored in BookKeeper alongside the data, and the broker enforces schema compatibility at produce time — a message that violates the schema is rejected before it ever hits storage.

// Pulsar — schema is part of the producer/consumer contract
Producer<User> producer = client.newProducer(Schema.JSON(User.class))
    .topic("users")
    .create();

Consumer<User> consumer = client.newConsumer(Schema.JSON(User.class))
    .topic("users")
    .subscriptionName("user-processor")
    .subscribe();

Pulsar supports multiple schema types (JSON, Avro, Protobuf, raw bytes) and schema compatibility strategies (BACKWARD, FORWARD, FULL, NONE) — all configured at the namespace level without deploying additional infrastructure.


Geo-Replication

Geo-replication is where the architectural differences between Kafka and Pulsar become most stark.

Kafka

Kafka does not have built-in geo-replication. The community solutions are:

  • MirrorMaker 2 — a Kafka Connect-based tool that replicates data between clusters. It works, but it is a separate system to deploy, monitor, and tune. It replicates at the topic level, and handling failover between clusters is an application-level concern.
  • Confluent Replicator — a commercial alternative with better operational tooling, but it is proprietary and adds licensing cost.

In both cases, geo-replication is bolted on rather than built in. You are running two (or more) independent Kafka clusters with a replication bridge between them.

Pulsar

Pulsar supports geo-replication as a first-class feature. You configure it at the namespace level:

# Enable geo-replication for a namespace
bin/pulsar-admin namespaces set-clusters \
    my-tenant/my-namespace \
    --clusters us-east,eu-west,ap-southeast

Once configured, every message published to a topic in that namespace is automatically replicated to all configured clusters. The replication is:

  • Asynchronous. Producers are not blocked waiting for cross-datacenter replication.
  • Bidirectional. Any cluster can accept writes, and messages are replicated to all other clusters.
  • Deduplication-aware. Pulsar’s message deduplication prevents duplicate messages when the same data arrives from multiple replication paths.

This is possible because of the separated storage architecture. Brokers in each cluster write to their local BookKeeper cluster, and the replication mechanism copies data between BookKeeper clusters asynchronously.


When to Choose What

Neither system is universally better. The right choice depends on your constraints:

Choose Kafka when:

  • You have an existing Kafka ecosystem with established tooling and team expertise.
  • Your workload is primarily high-throughput event streaming with simple consumer group semantics.
  • You need the mature ecosystem of Kafka Connect, Kafka Streams, and ksqlDB.
  • You are running a single-datacenter deployment without multi-tenancy requirements.

Choose Pulsar when:

  • You need multi-tenancy with namespace-level isolation on shared infrastructure.
  • Your architecture requires independent scaling of compute and storage.
  • You need built-in geo-replication across multiple datacenters.
  • Your consumption patterns require flexible subscription models (shared, key-shared) beyond consumer groups.
  • You are building on Kubernetes and want a system that works naturally behind load balancers.
  • You want async-first client APIs with built-in dead letter handling and negative acknowledgment.

What Is Coming in Next Parts

Next parts will move from architecture to measurement. We will cover:

  • Latency benchmarks — publish latency, end-to-end latency, and tail latency under varying load.
  • Throughput benchmarks — maximum sustainable throughput with different message sizes, batching configurations, and replication factors.
  • Client behavior under pressure — how each client handles backpressure, broker failures, and network partitions.
  • Resource consumption — CPU, memory, and disk I/O profiles for comparable workloads.

All benchmarks will use comparable hardware, configurations, and workload profiles to ensure a fair comparison.


References

  1. Apache Kafka Documentation
  2. Apache Pulsar Documentation
  3. Apache BookKeeper Documentation
  4. Pulsar vs Kafka — Splunk Comparison
  5. Kafka Consumer Group Protocol
  6. Pulsar Subscription Types
  7. Pulsar Geo-Replication
  8. Confluent Schema Registry