Consumer Group Rebalancing

How Kafka redistributes partitions among consumers when members join, leave, or fail.

Read this as What pauses, moves, and duplicates when group membership changes?
Failure Trap
Ignoring partition revocation and offset commits, then seeing duplicate or stalled processing.
Decision Rule
Handle revoke, commit, assign, and resume deliberately; use cooperative rebalancing when pause time matters.
Kafka consumer group rebalancing Six topic partitions are shared by a consumer group. A consumer crashes, the coordinator triggers a stop-the-world rebalance, the two surviving consumers take over the orphaned partitions, processing resumes, and finally a new consumer joins to scale the group back to two partitions each. Topic "events" · 6 partitions P0 P1 P2 P3 P4 P5 C1 P0 P1 C2 P2 P3 C3 P4 P5 Balanced · 2 partitions each Group "order-processors" — all healthy Topic "events" · 6 partitions P0 P1 P2 P3 P4 P5 C1 P0 P1 C2 ✕ CRASHED C3 P4 P5 Session timeout · 10s C2 stopped heartbeating — P2 P3 orphaned Topic "events" · 6 partitions P0 P1 P2 P3 P4 P5 Coordinator C1 ⏸ paused C3 ⏸ paused ⚠ Rebalancing Stop-the-world — all consumption paused Topic "events" · 6 partitions P0 P1 P2 P3 P4 P5 C1 P0 P1 P2 C3 P3 P4 P5 2 partitions reassigned 2 survivors absorb P2 P3 — 3 each Topic "events" · 6 partitions P0 P1 P2 P3 P4 P5 C1 P0 P1 P2 C3 P3 P4 P5 ✓ Rebalance complete Resumes from committed offsets Topic "events" · 6 partitions P0 P1 P2 P3 P4 P5 C1 P0 P1 C3 P2 P3 C4 P4 P5 NEW Balanced · 2 each 3 consumers · max parallelism is 6
1 / ?

Balanced Consumer Group

A consumer group is processing a topic with 6 partitions. Three consumers share the load, each handling 2 partitions. This is the optimal distribution — maximum parallelism with even load.

Each consumer regularly sends heartbeats to the group coordinator.

  • One partition → exactly one consumer (within group)
  • Even distribution maximizes throughput
  • Heartbeats prove liveness

Consumer C2 Crashes!

C2 stops sending heartbeats. After the session timeout (default 10 seconds), the coordinator considers C2 dead. Its partitions (P2, P3) are now orphaned.

Messages accumulate on these partitions — no one is consuming them.

  • Session timeout detects failures
  • Orphaned partitions stop processing
  • Lag increases during failure

Group Coordinator Triggers Rebalance

The coordinator initiates a rebalance. It sends a signal to all consumers: "Stop processing, we're redistributing partitions."

During rebalance, the entire consumer group pauses. This is the "stop-the-world" moment.

  • Rebalance affects ALL consumers
  • Processing halts during rebalance
  • Critical for correctness, painful for latency

Partitions Redistributed

The coordinator (or a designated consumer) runs the partition assignment strategy. The orphaned partitions P2 and P3 are handed to the two surviving consumers.

C1 now handles P0, P1, P2 and C3 handles P3, P4, P5. Load is still balanced across the two survivors — 3 partitions each, so each does more work until the group scales back out.

  • Assignment strategies: Range, RoundRobin, Sticky
  • Sticky minimizes partition movement
  • Remaining consumers handle more partitions

Processing Resumes

Consumers receive their new assignments and resume processing. Each consumer seeks to the last committed offset for newly assigned partitions.

The accumulated lag is processed, and the group catches up.

  • Offset tracking enables seamless handoff
  • Lag from pause gets processed
  • Group returns to steady state

Scaling Up: C4 Joins

When traffic increases, we add a new consumer, C4. The group now has three members (the two survivors C1 and C3, plus C4), and the new join triggers another rebalance.

The 6 partitions redistribute evenly again — C1 keeps P0, P1; C3 hands P4, P5 to C4 and keeps P2, P3. Every consumer is back to 2 partitions each. More consumers means more parallelism, up to the partition count (6).

  • Scale by adding consumers
  • Maximum useful consumers = partition count
  • Each join or leave triggers a rebalance