Hi all,

We have a small Kafka cluster (0.7.1 - 3 nodes) in EC2. The load is about
200 million events per day, each being few kilobytes. We have a single node
zookeeper.

Yesterday suddenly our Kafka clients started throwing the following
exception:
java.lang.RuntimeException: kafka.common.ConsumerRebalanceFailedException:
CONSUMER_GROUP_NAME_ip-00-00-00-00.ec2.internal-1373821190828-5f78e9af
can't rebalance after 4 retries
    at
com.gumgum.kafka.consumer.KafkaTemplate.executeWithBatch(KafkaTemplate.java:59)
    at
com.gumgum.storm.fileupload.GenericKafkaSpout.nextTuple(GenericKafkaSpout.java:73)
    at
backtype.storm.daemon.executor$fn__3968$fn__4009$fn__4010.invoke(executor.clj:433)
    at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)

None of the Kafka clients (ConsumerConenctor class) would start. They would
fail with the exception.

We tried restarting the clilents, restarting the zookeeper as well. But
finally it all started working when we restarted all of our kafka brokers.
We didn't lose any data because producers (going directly to the brokers
through a load balancer) were working fine.

I tried googling this issue and looks like lot of people have faced it, but
couldn't get anything concrete.

Given this, I have two questions:

It will be nice if you can tell me why this can happen or point me to a
link where I can understand it better. What does Consumer Rebalancing mean?
Does that mean consumers are trying to coordinate amongst themselves using
Zookeeper?

On a separate note, are there any JMX parameters I need to be monitoring to
make sure that my kafka cluster is healthy? How can I keep watch on my
kafka cluster?

Regards,
Vaibhav Puranik
GumGum

Reply via email to