Rebalance and Failures

Kerry Wei Tue, 19 Jul 2016 15:29:37 -0700

Hi all,
Bit confused on rebalance and failures:

(if understand correctly about rebalance procedure)
Suppose during the middle of the rebalance, some consumer, C1, hits an
unclean shutdown (i.e. crashes, or kill -9), and the coordinator won't be
aware that C1 is dead until {zookeeper.session.timeout.ms} time passed; the
rebalance will fail as the partitions of this dead consumer can't be
released and distributed to other consumers.
Realizing C1 is dead, the coordinator exclude it from rebalance loop, and
starts a second retry. However, another consumer, C2, hits an unclean
shutdown during the second re-balance, causing the rebalance failed
again... and if the coordinator exhausted all retries (with {
rebalance.backoff.ms} time in between each retry), then the rebalance will
not complete.


My question is: what is the consequences/results of a eventually-failed
rebalance? i.e. some partitions held by the dead consumers will not be
consumed?

If there are new consumers joining the group during rebalance and existing
consumers crashed/kill-9, does it mean that rebalance could continue
forever? If so, what would be a good time to stop retry? i.e. Let
{rebalance.max.retries} * {rebalance.backoff.ms} > N * {
zookeeper.session.timeout.ms}, where N controls how many times you want to
survive a consumer crash during rebalance



BTW, how do you search kafka email archives?

Thanks!
Kerry

Rebalance and Failures

Reply via email to