>From the logs it seems the consumer's ZK registry log has lost, while KAFKA-693 is mainly due to server side issue. Could you check if there is a session timeout from the consumer on ZK log.
Guozhang On Sat, Jan 11, 2014 at 2:33 PM, Seshadri, Balaji <balaji.sesha...@dish.com>wrote: > We found the below bug in Kafka JIRA. > > https://issues.apache.org/jira/browse/KAFKA-693 > > Could you guys please let us know if this bug is causing the fetchers > getting stooped because of rebalance failure,we are using 0.8-beta and this > fix seems to be in 0.8 version. > > The issue happens only when we run it in tomcat,when we use JUnits with > same group.id to test we receive the messages.It would be great if you > could help us confirm if it is a bug. > > Also we would like to know if we need to upgrade the server/client to 0.8 > or just server side code. > > > ________________________________________ > From: Rob Withers [robert.w.with...@gmail.com] > Sent: Friday, January 10, 2014 10:17 PM > To: users@kafka.apache.org > Subject: Re: Looks like consumer fetchers get stopped we are not getting > any data > > That was an interesting section too. Which GC settings would you suggest? > > Thank you, > - charlie > > > On Jan 10, 2014, at 10:11 PM, Jun Rao <jun...@gmail.com> wrote: > > > > Have you looked at our FAQ, especially > > > https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog > > ? > > > > Thanks, > > > > Jun > > > > > > On Fri, Jan 10, 2014 at 2:25 PM, Seshadri, Balaji > > <balaji.sesha...@dish.com>wrote: > > > >> Any clue would be helpful. > >> > >> -----Original Message----- > >> From: Seshadri, Balaji [mailto:balaji.sesha...@dish.com] > >> Sent: Friday, January 10, 2014 12:46 PM > >> To: users@kafka.apache.org > >> Subject: RE: Looks like consumer fetchers get stopped we are not getting > >> any data > >> > >> Yes rebalance begins and exceptions occurs. > >> > >> > >> {2014-01-10 00:58:11,293} INFO > >> > [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad59660b_watcher_executor] > >> (?:?) - [account-i > >> nfo-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad59660b], Cleared > the > >> data chunks in all the consumer message iterators > >> {2014-01-10 00:58:11,293} INFO > >> > [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad59660b_watcher_executor] > >> (?:?) - [account-i > >> nfo-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad59660b], > Committing > >> all offsets after clearing the fetcher queues > >> {2014-01-10 00:58:11,298} DEBUG [catalina-exec-12-SendThread( > >> tvip-m1-mw-zookeeper.dish.com:2181)] (ClientCnxn.java:759) - Got ping > >> response for sessionid: 0x1437b2879870005 af ter 0ms > >> {2014-01-10 00:58:11,313} INFO > >> > [account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738_watcher_executor] > >> (?:?) - > >> [account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738], > >> begin rebalancing consumer > >> account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738 try > #1 > >> {2014-01-10 00:58:11,314} DEBUG [catalina-exec-12-SendThread( > >> tvip-m1-mw-zookeeper.dish.com:2181)] (ClientCnxn.java:839) - Reading > >> reply sessionid:0x1437b2879870005, packet:: clientPath:null > serverPath:null > >> finished:false header:: 627,8 replyHeader:: 627,51539619966,0 > request:: > >> '/brokers/ids,F response:: v{'1} > >> {2014-01-10 00:58:11,315} DEBUG [catalina-exec-12-SendThread( > >> tvip-m1-mw-zookeeper.dish.com:2181)] (ClientCnxn.java:839) - Reading > >> reply sessionid:0x1437b2879870005, packet:: clientPath:null > serverPath:null > >> finished:false header:: 628,4 replyHeader:: 628,51539619966,0 > request:: > >> '/brokers/ids/1,F response:: > >> > #7b2022686f7374223a22746d312d6b61666b6162726f6b6572313031222c20226a6d785f706f7274223a393939392c2022706f7274223a393039322c202276657273696f6e223a31207d,s{47244644685,47244644685,1388537628753,1388537628753,0,0,0,163056791896588316,74,0,47244644685} > >> {2014-01-10 00:58:11,316} DEBUG [catalina-exec-12-SendThread( > >> tvip-m1-mw-zookeeper.dish.com:2181)] (ClientCnxn.java:839) - Reading > >> reply sessionid:0x1437b2879870005, packet:: clientPath:null > serverPath:null > >> finished:false header:: 629,4 replyHeader:: 629,51539619966,-101 > >> request:: > >> > '/consumers/account-activated-hadoop-consumer/ids/account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738,F > >> response:: > >> {2014-01-10 00:58:11,316} INFO > >> > [account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738_watcher_executor] > >> (?:?) - > >> [account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738], > >> exception during rebalance > >> org.I0Itec.zkclient.exception.ZkNoNodeException: > >> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > >> NoNode for > >> > /consumers/account-activated-hadoop-consumer/ids/account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738 > >> at > >> org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) > >> at > >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) > >> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766) > >> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761) > >> at kafka.utils.ZkUtils$.readData(Unknown Source) > >> at kafka.consumer.TopicCount$.constructTopicCount(Unknown Source) > >> at > >> > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(Unknown > >> Source) > >> at > >> > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(Unknown > >> Source) > >> at > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:142) > >> at > >> > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(Unknown > >> Source) > >> at > >> > kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(Unknown > >> Source) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > >> KeeperErrorCode = NoNode for > >> > /consumers/account-activated-hadoop-consumer/ids/account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738 > >> at > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > >> at > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > >> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) > >> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956) > >> at > org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103) > >> : > >> -----Original Message----- > >> From: Guozhang Wang [mailto:wangg...@gmail.com] > >> Sent: Friday, January 10, 2014 12:30 PM > >> To: users@kafka.apache.org > >> Subject: Re: Looks like consumer fetchers get stopped we are not getting > >> any data > >> > >> Actually, the broken channel is broken by shutting down the > >> leader-finder-thread, which is shutdown either by a rebalance retry or > >> shutting down the consumer. > >> > >> Do you see "begin rebalance ..." before this log entry? And if yes, > search > >> to see if the rebalance keep failing. > >> > >> Guozhang > >> > >> > >> On Fri, Jan 10, 2014 at 11:23 AM, Guozhang Wang <wangg...@gmail.com> > >> wrote: > >> > >>> From your logs the channel with the brokers are broken, are the > >>> brokers alive at that time? > >>> > >>> Guozhang > >>> > >>> > >>> On Fri, Jan 10, 2014 at 10:52 AM, Withers, Robert > >>> <robert.with...@dish.com > >>>> wrote: > >>> > >>>> The core problem is our consumers stop consuming and lag increases. > >>>> We found this blog: > >> > https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Myconsumerseemstohavestopped,why > >> ?. > >>>> This lists 3 possibilities. > >>>> > >>>> The blog also talks earlier about spurious rebalances, due to > >>>> improper GC settings, but we couldn't find what GC settings to use. > >>>> We are considering changing the zookeeper timeouts. We are a little > >>>> confused about the various issues, the sequence of issues and what > >>>> could cause the consumers to stop reading. If the fetchers get > >>>> shutdown, due to a ClosedByInterruptException in the "leader_finder" > >>>> thread, which tells the "executor_watcher" thread to shutdown the > >>>> fetchers, that would be another reason the consumers stop processing > >> data. Is this possible? > >>>> > >>>> Thank you, > >>>> rob > >>>> > >>>> -----Original Message----- > >>>> From: Seshadri, Balaji [mailto:balaji.sesha...@dish.com] > >>>> Sent: Friday, January 10, 2014 11:40 AM > >>>> To: users@kafka.apache.org > >>>> Subject: RE: Looks like consumer fetchers get stopped we are not > >>>> getting any data > >>>> > >>>> It would be helpful if you guys can shed some light why all fetchers > >>>> are getting stopped. > >>>> > >>>> -----Original Message----- > >>>> From: Seshadri, Balaji [mailto:balaji.sesha...@dish.com] > >>>> Sent: Friday, January 10, 2014 11:28 AM > >>>> To: users@kafka.apache.org > >>>> Subject: RE: Looks like consumer fetchers get stopped we are not > >>>> getting any data > >>>> > >>>> We also got the below error when this happens. > >>>> > >>>> {2014-01-10 00:58:11,292} INFO > >>>> > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b_watcher_executor] > >>>> (?:?) - > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b], > >>>> exception during rebalance > >>>> org.I0Itec.zkclient.exception.ZkNoNodeException: > >>>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode > >>>> = NoNode for > >> > /consumers/account-info-updated-hadoop-consumer/ids/account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad59660b > >>>> at > >>>> org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) > >>>> at > >>>> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) > >>>> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766) > >>>> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761) > >>>> at kafka.utils.ZkUtils$.readData(Unknown Source) > >>>> at kafka.consumer.TopicCount$.constructTopicCount(Unknown > >> Source) > >>>> at > >>>> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$ > >>>> consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(U > >>>> nknown > >>>> Source) > >>>> at > >>>> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonf > >>>> un$syncedRebalance$1.apply$mcVI$sp(Unknown > >>>> Source) > >>>> at > >>>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:142) > >>>> at > >>>> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.synced > >>>> Rebalance(Unknown > >>>> Source) > >>>> at > >>>> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$ > >>>> 1.run(Unknown > >>>> Source) Caused by: > org.apache.zookeeper.KeeperException$NoNodeException: > >>>> KeeperErrorCode = NoNode for > >> > /consumers/account-info-updated-hadoop-consumer/ids/account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad59660b > >>>> at > >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > >>>> at > >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > >>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) > >>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956) > >>>> at > >>>> org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103) > >>>> at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770) > >>>> at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766) > >>>> at > >>>> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > >>>> ... 9 more > >>>> > >>>> -----Original Message----- > >>>> From: Seshadri, Balaji [mailto:balaji.sesha...@dish.com] > >>>> Sent: Friday, January 10, 2014 10:52 AM > >>>> To: users@kafka.apache.org > >>>> Subject: Looks like consumer fetchers get stopped we are not getting > >>>> any data > >>>> > >>>> Please let us know why we are not getting any data from Kafaka after > >>>> this log from Kafka,can you guys lets us know. > >>>> > >>>> What could be causing all fetchers associated to be stooped why it is > >>>> not doing retry. > >>>> > >>>> {2014-01-10 00:58:09,284} WARN > >>>> > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b-leader-finder-thread] > >>>> (?:?) - Fetching topic metadata with correlation id 3 for topics > >>>> [Set(account-info-updated)] from broker > >>>> [id:1,host:tm1-kafkabroker101,port:9092] failed > >>>> java.nio.channels.ClosedByInterruptException > >>>> at > >> > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > >>>> at > >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:506) > >>>> at > java.nio.channels.SocketChannel.write(SocketChannel.java:493) > >>>> at kafka.network.BoundedByteBufferSend.writeTo(Unknown Source) > >>>> at kafka.network.Send$class.writeCompletely(Unknown Source) > >>>> at > >>>> kafka.network.BoundedByteBufferSend.writeCompletely(Unknown > >>>> Source) > >>>> at kafka.network.BlockingChannel.send(Unknown Source) > >>>> at kafka.producer.SyncProducer.liftedTree1$1(Unknown Source) > >>>> at > >>>> kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(Unkno > >>>> wn > >>>> Source) > >>>> at kafka.producer.SyncProducer.send(Unknown Source) > >>>> at kafka.client.ClientUtils$.fetchTopicMetadata(Unknown Source) > >>>> at kafka.client.ClientUtils$.fetchTopicMetadata(Unknown Source) > >>>> at > >>>> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(Unkno > >>>> wn > >>>> Source) > >>>> at kafka.utils.ShutdownableThread.run(Unknown Source) > >>>> {2014-01-10 00:58:09,284} DEBUG > >>>> [account-access-hadoop-consumer_tm1mwdpl04-1389222551916-a0c87abc_wat > >>>> cher_executor] > >>>> (?:?) - initial fetch offset of account-access:27: fetched offset = > >> 9655: > >>>> consumed offset = 9655 is 9655 > >>>> {2014-01-10 00:58:09,284} DEBUG > >>>> [bill-generated-hadoop-consumer_tm1mwdpl04-1389222547995-29a6dce9_wat > >>>> cher_executor] > >>>> (?:?) - initial consumer offset of bill-generated:11: fetched offset = > >> 152: > >>>> consumed offset = 152 is 152 > >>>> {2014-01-10 00:58:09,284} DEBUG > >>>> [outbound-communications-hadoop-consumer_tm1mwdpl04-1389222550693-8bc > >>>> 34b77_watcher_executor] > >>>> (?:?) - > >>>> [outbound-communications-hadoop-consumer_tm1mwdpl04-1389222550693-8bc > >>>> 34b77], > >>>> outbound-communications:108: fetched offset = 1689: consumed offset = > >>>> 1689 selected new offset 1689 > >>>> {2014-01-10 00:58:09,284} DEBUG [catalina-exec-3-SendThread( > >>>> tvip-m1-mw-zookeeper.dish.com:2181)] (ClientCnxn.java:839) - Reading > >>>> reply sessionid:0x1434b49cf56383b, packet:: clientPath:null > >>>> serverPath:null finished:false header:: 279,4 replyHeader:: > >> 279,51539617506,0 request:: > >>>> '/consumers/outbound-call-attempted-hadoop-consumer/offsets/outbound- > >>>> call-attempted/14,F > >>>> response:: > >>>> #30,s{39860186414,39860186414,1387517714994,1387517714994,0,0,0,0,1,0 > >>>> ,39860186414} > >>>> {2014-01-10 00:58:09,285} INFO > >>>> > >>>> [outbound-communications-hadoop-consumer_tm1mwdpl04-1389222550693-8bc > >>>> 34b77_watcher_executor] > >>>> (?:?) - > >>>> [outbound-communications-hadoop-consumer_tm1mwdpl04-1389222550693-8bc > >>>> 34b77], > >>>> outbound-communications-hadoop-consumer_tm1mwdpl04-1389222550693-8bc3 > >>>> 4b77-3 > >>>> attempting to claim partition 109 > >>>> {2014-01-10 00:58:09,284} DEBUG > >>>> [bill-generated-hadoop-consumer_tm1mwdpl04-1389222547995-29a6dce9_wat > >>>> cher_executor] > >>>> (?:?) - initial fetch offset of bill-generated:11: fetched offset = > 152: > >>>> consumed offset = 152 is 152 > >>>> {2014-01-10 00:58:09,284} DEBUG [catalina-exec-12-SendThread( > >>>> tvip-m1-mw-zookeeper.dish.com:2181)] (ClientCnxn.java:839) - Reading > >>>> reply sessionid:0x1437b2879870005, packet:: clientPath:null > >>>> serverPath:null finished:false header:: 619,1 replyHeader:: > >> 619,51539617508,0 request:: > >>>> '/consumers/account-activated-hadoop-consumer/owners/account-activate > >>>> d/68,#6163636f756e742d6163746976617465642d6861646f6f702d636f6e73756d6 > >>>> 5725f746d316d7764706c30342d313338393232323535373930362d35363262363733 > >>>> 382d30,v{s{31,s{'world,'anyone}}},1 > >>>> response:: > >>>> '/consumers/account-activated-hadoop-consumer/owners/account-activate > >>>> d/68 > >>>> {2014-01-10 00:58:09,284} DEBUG > >>>> [account-access-hadoop-consumer_tm1mwdpl04-1389222551916-a0c87abc_wat > >>>> cher_executor] > >>>> (?:?) - > >>>> [account-access-hadoop-consumer_tm1mwdpl04-1389222551916-a0c87abc], > >>>> account-access:27: fetched offset = 9655: consumed offset = 9655 > >>>> selected new offset 9655 > >>>> {2014-01-10 00:58:09,284} INFO > >>>> > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b_watcher_executor] > >>>> (?:?) - > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b-leader-finder-thread], > >>>> Shutdown completed > >>>> {2014-01-10 00:58:09,284} INFO > >>>> > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b-leader-finder-thread] > >>>> (?:?) - > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b-leader-finder-thread], > >>>> Stopped > >>>> {2014-01-10 00:58:09,285} INFO > >>>> > >>>> [account-info-updated-hadoop-consumer_tm1mwdpl04-1389222553159-ad5966 > >>>> 0b_watcher_executor] > >>>> (?:?) - [ConsumerFetcherManager-1389222553163] Stopping all fetchers > >>> > >>> > >>> -- > >>> -- Guozhang > >> > >> > >> > >> -- > >> -- Guozhang > >> > >> > -- -- Guozhang