I'm working on some fault-tolerant consumer group. The idea is this, to maximize the throughput of kafka. I request the metadata from broker and create #{num of partition} consumers for each topic and distribute them on different nodes. Moreover, there is mechanism to detect fail of any node and restart it. The problem is if I kill one of the consumer process, my program would detect and relaunch a new consumer with same group id and client id. But it would have some error(something like zookeeper entry doesn't exist, i didn't keep the log) and never start. I think the root cause is the zookeeper detect the fail of old consumer process, before it delete the consumer, the new consumer is coming up and communicate with the zookeeper, and at this time the zookeeper delete the entry of that consumer, and the new consumer fail to be recognized by zookeeper. The sequence is like this: old consumer die -> zookeeper detect -> new consumer(same groupid clientid) up -> zookeeper delete consumer -> new consumer find error and not recognized by zookeeper
It's ok that I wont lose any data cause that data will go to other consumer, but it's annoying that I want to keep consumer group balanced after fail-over Thanks, Siyuan