rebalance.backoff.ms Thanks,
Jun On Mon, Dec 2, 2013 at 11:31 AM, Yu, Libo <libo...@citi.com> wrote: > Thanks for your insights, Jun. That is really helpful. I forgot to mention > the cause of the issue in my previous > Email. We have three brokers. I notice from the log that all three brokers > re-registered themselves with zk. > That means all of them were somehow offline for a short time and then > automatically got online again. That > caused the rebalance failure. While all the brokers are offline, I assume > a consumer will constantly retry to > establish connection again. How long is the interval between the retries? > Is it max.fetch.wait + socket.timeout.ms? > Thanks. > > Libo > > > -----Original Message----- > From: Jun Rao [mailto:jun...@gmail.com] > Sent: Monday, December 02, 2013 11:55 AM > To: users@kafka.apache.org > Subject: Re: ConsumerRebalanceFailedException > > Is the failure on the last rebalance? If so, some partitions will not have > any consumers. A common reason for rebalance failure is that there is > conflict in owning partitions among different consumers in the same group. > Increasing the # retries and the amount of backoff time btw retires should > help. Our default setting should be good enough if there are not too many > topics being subscribed and the ZK latency is normal. > > Thanks, > > Jun > > > On Mon, Dec 2, 2013 at 6:57 AM, Yu, Libo <libo...@citi.com> wrote: > > > Actually, I saw this line in the log : can't rebalance after 4 retries. > > What should I expect in this case? All consumers threads failed or > > only some of them? > > If I increase the number of retries or delay between retries, will > > that help? > > > > Regards, > > > > Libo > > > > > > -----Original Message----- > > From: Jun Rao [mailto:jun...@gmail.com] > > Sent: Friday, November 29, 2013 8:50 PM > > To: users@kafka.apache.org > > Subject: Re: ConsumerRebalanceFailedException > > > > Transient rebalance failures are ok. However, it's important that the > > last rebalance in a sequence succeeds. Otherwise, some of the > > partitions will not be consumed by any consumers. > > > > Thanks, > > > > Jun > > > > > > On Fri, Nov 29, 2013 at 10:44 AM, Yu, Libo <libo...@citi.com> wrote: > > > > > You are right, Joe. I checked our brokers' log. We have three brokers. > > > All of them failed to connect to zk at some point. > > > So they were offline and later reregistered themselves with the zk. > > > I don't know how many rebalance should be triggered in that case. > > > There is only one exception found in consumer's log. My question is > > > whether users need to do anything to handle > ConsumerRebalanceFailedException. > > > > > > This is from consumer log: > > > > > > [28/11/13 16:38:56:056 PM EST] 102 ERROR > > > consumer.ZookeeperConsumerConnector: [xxxxxxxxxx ], error during > > > syncedRebalance > > > kafka.common.ConsumerRebalanceFailedException: xxxxxxxxx can't > > > rebalance after 4 retries > > > at > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.synce > > > dR > > > eb > > > alance(ZookeeperConsumerConnector.scala:397) > > > at > > > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon > > > $1 > > > .r > > > un(ZookeeperConsumerConnector.scala:326) > > > > > > Regards, > > > > > > Libo > > > > > > > > > -----Original Message----- > > > From: Joe Stein [mailto:joe.st...@stealth.ly] > > > Sent: Friday, November 29, 2013 11:57 AM > > > To: users@kafka.apache.org > > > Subject: Re: ConsumerRebalanceFailedException > > > > > > What is the full stack trace? if you see "can't rebalance after 4 > > retries" > > > then likely the problem is the broker is down or not available > > > > > > /******************************************* > > > Joe Stein > > > Founder, Principal Consultant > > > Big Data Open Source Security LLC > > > http://www.stealth.ly > > > Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> > > > ********************************************/ > > > > > > > > > On Fri, Nov 29, 2013 at 11:31 AM, Yu, Libo <libo...@citi.com> wrote: > > > > > > > We found our consumer stopped working after this exception occurred. > > > > Can the consumer recover from such an exception? > > > > > > > > Regards, > > > > > > > > Libo > > > > > > > > > > > > -----Original Message----- > > > > From: Florin Trofin [mailto:ftro...@adobe.com] > > > > Sent: Tuesday, July 16, 2013 4:20 PM > > > > To: users@kafka.apache.org > > > > Subject: Re: ConsumerRebalanceFailedException > > > > > > > > Yes, I think these are two separate issues. > > > > > > > > F. > > > > > > > > On 7/16/13 11:32 AM, "Joel Koshy" <jjkosh...@gmail.com> wrote: > > > > > > > > >From a user's perspective, ConsumerRebalanceException is a bit > > > > >cryptic -I think the other thread was to provide a more > > > > >informative message and also be able to recover when a broker > > > > >does come up (fixed in KAFKA-969). > > > > > > > > > >Thanks, > > > > > > > > > >Joel > > > > > > > > > >On Tue, Jul 16, 2013 at 11:04 AM, Vaibhav Puranik > > > > ><vpura...@gmail.com> > > > > >wrote: > > > > >> Thank you Joel. > > > > >> > > > > >> In a different but related thread, somebody is asking to rename > > > > >> the exception as NoBrokerAvailableExcption. But given the > > > > >> description above, the exception seems to be named appropriately. > > > > >> > > > > >> Regards, > > > > >> Vaibhav > > > > >> > > > > >> > > > > >> On Tue, Jul 16, 2013 at 12:05 AM, Joel Koshy > > > > >><jjkosh...@gmail.com> > > > > >>wrote: > > > > >> > > > > >>> Yes - rebalance => consumers trying to coordinate through ZK. > > > > >>> Rebalances can happen when one or more of the following happen: > > > > >>> - a consumed topic partition appears or disappears - i.e., if > > > > >>> a broker comes or goes. > > > > >>> - a consumer instance in the group comes or goes "goes" could > > > > >>> also be triggered by session expirations in zookeeper - > > > > >>> typically caused by client-side GC or flaky connections to > > zookeeper. > > > > >>> > > > > >>> On Mon, Jul 15, 2013 at 10:15 AM, Vaibhav Puranik > > > > >>> <vpura...@gmail.com> > > > > >>> wrote: > > > > >>> > Hi all, > > > > >>> > > > > > >>> > We have a small Kafka cluster (0.7.1 - 3 nodes) in EC2. The > > > > >>> > load is > > > > >>>about > > > > >>> > 200 million events per day, each being few kilobytes. We > > > > >>> > have a > > > > >>>single > > > > >>> node > > > > >>> > zookeeper. > > > > >>> > > > > > >>> > Yesterday suddenly our Kafka clients started throwing the > > > > >>> > following > > > > >>> > exception: > > > > >>> > java.lang.RuntimeException: > > > > >>> kafka.common.ConsumerRebalanceFailedException: > > > > >>> > > > > > >>>CONSUMER_GROUP_NAME_ip-00-00-00-00.ec2.internal-1373821190828-5 > > > > >>>f7 > > > > >>>8e > > > > >>>9a > > > > >>>f > > > > >>> > can't rebalance after 4 retries > > > > >>> > at > > > > >>> > > > > > >>> > > > > >>>com.gumgum.kafka.consumer.KafkaTemplate.executeWithBatch(KafkaT > > > > >>>em > > > > >>>pl > > > > >>>at > > > > >>>e.j > > > > >>>ava:59) > > > > >>> > at > > > > >>> > > > > > >>> > > > > >>>com.gumgum.storm.fileupload.GenericKafkaSpout.nextTuple(Generic > > > > >>>Ka > > > > >>>fk > > > > >>>aS > > > > >>>pou > > > > >>>t.java:73) > > > > >>> > at > > > > >>> > > > > > >>> > > > > >>>backtype.storm.daemon.executor$fn__3968$fn__4009$fn__4010.invok > > > > >>>e( > > > > >>>ex > > > > >>>ec > > > > >>>uto > > > > >>>r.clj:433) > > > > >>> > at > > > > >>> > backtype.storm.util$async_loop$fn__465.invoke(util.clj:377) > > > > >>> > > > > > >>> > None of the Kafka clients (ConsumerConenctor class) would > start. > > > > >>> > They > > > > >>> would > > > > >>> > fail with the exception. > > > > >>> > > > > > >>> > We tried restarting the clilents, restarting the zookeeper > > > > >>> > as > > well. > > > > >>>But > > > > >>> > finally it all started working when we restarted all of our > > > > >>> > kafka > > > > >>> brokers. > > > > >>> > We didn't lose any data because producers (going directly to > > > > >>> > the > > > > >>>brokers > > > > >>> > through a load balancer) were working fine. > > > > >>> > > > > > >>> > I tried googling this issue and looks like lot of people > > > > >>> > have faced > > > > >>>it, > > > > >>> but > > > > >>> > couldn't get anything concrete. > > > > >>> > > > > > >>> > Given this, I have two questions: > > > > >>> > > > > > >>> > It will be nice if you can tell me why this can happen or > > > > >>> > point me > > > > >>>to a > > > > >>> > link where I can understand it better. What does Consumer > > > > >>> > Rebalancing > > > > >>> mean? > > > > >>> > Does that mean consumers are trying to coordinate amongst > > > > >>> > themselves > > > > >>> using > > > > >>> > Zookeeper? > > > > >>> > > > > > >>> > On a separate note, are there any JMX parameters I need to > > > > >>> > be > > > > >>>monitoring > > > > >>> to > > > > >>> > make sure that my kafka cluster is healthy? How can I keep > > > > >>> > watch on > > > > >>>my > > > > >>> > kafka cluster? > > > > >>> > > > > > >>> > Regards, > > > > >>> > Vaibhav Puranik > > > > >>> > GumGum > > > > >>> > > > > > > > > > > > > > >