Hi all, Bit confused on rebalance and failures: (if understand correctly about rebalance procedure) Suppose during the middle of the rebalance, some consumer, C1, hits an unclean shutdown (i.e. crashes, or kill -9), and the coordinator won't be aware that C1 is dead until {zookeeper.session.timeout.ms} time passed; the rebalance will fail as the partitions of this dead consumer can't be released and distributed to other consumers. Realizing C1 is dead, the coordinator exclude it from rebalance loop, and starts a second retry. However, another consumer, C2, hits an unclean shutdown during the second re-balance, causing the rebalance failed again... and if the coordinator exhausted all retries (with { rebalance.backoff.ms} time in between each retry), then the rebalance will not complete.
My question is: what is the consequences/results of a eventually-failed rebalance? i.e. some partitions held by the dead consumers will not be consumed? If there are new consumers joining the group during rebalance and existing consumers crashed/kill-9, does it mean that rebalance could continue forever? If so, what would be a good time to stop retry? i.e. Let {rebalance.max.retries} * {rebalance.backoff.ms} > N * { zookeeper.session.timeout.ms}, where N controls how many times you want to survive a consumer crash during rebalance BTW, how do you search kafka email archives? Thanks! Kerry