What's the revision of the 0.8 branch that you used? If that's older than the beta1 release, I recommend that you upgrade.
Thanks, Jun On Tue, Jul 30, 2013 at 3:09 AM, Hargett, Phil < phil.harg...@mirror-image.com> wrote: > No, sorry, it didn't take 90 seconds to connect to ZK (at least I hope > not). I had my consumer open for 90 secs in this case before shutting it > down and disposing of it—hence any races caused by fast startup/shutdown > should not have been relevant. > > I build from source off of the 0.8 branch, so isn't that pretty close to > beta 1? > > :) > > On Jul 30, 2013, at 12:22 AM, "Jun Rao" <jun...@gmail.com<mailto: > jun...@gmail.com>> wrote: > > Hmm, it takes 90 secs to connect to ZK? That seems way too long. Is your > ZK healthy. > > Also, are you on the 0.8 beta1 release? If not, could you try that one? It > may not be related, but we did fix some consumer side deadlock issues there. > > Thanks, > > Jun > > > On Mon, Jul 29, 2013 at 9:02 AM, Hargett, Phil < > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>> > wrote: > I think we have 3 different classes in play here: > > * kafka.consumer.ZookeeperConsumerConnector > * kafka.utils.ShutdownableThread > * kafka.consumer.ConsumerFetcherManager > > The actual consumer is the first one, and it does a fair amount of work > *before* stopping the fetcher—which then results in shutting down the > leader thread > > If the initial connectZk method in ZookeeperConsumerConnector takes a long > time (more than 90 seconds in 1 case this morning), then I could see the > fetcher's stopConnections method not getting called, because there isn't a > ConsumerFetcherManager instance yet. > > In other words, we could be shutting down the consumer before it is fully > initialized—but that doesn't seem correct, as the code in > ZookeeperConsumerConnector is synchronous—my application wouldn't have a > reference to a partially initialized consumer. > > Odd. > > :) > > On Jul 29, 2013, at 11:22 AM, "Jun Rao" <jun...@gmail.com<mailto: > jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>> > wrote: > > There seems to be two separate issues. > > 1. Why do you see NullPointerException in the leaderFinder thread? I am > not sure what's causing this. In the normal path, when a consumer connector > is shut down (this is when the pointer is set to null), it first waits for > the leaderFinder thread to shut down. Do you think that you can provide a > test case that reproduces this and attach it to the jira? > > 2. It seems that you have lots of consumer rebalances. This is good to > avoid since it can slow down the consumption. > > Thanks, > > Jun > > > On Mon, Jul 29, 2013 at 6:21 AM, Hargett, Phil < > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>>> wrote: > Why would a consumer that has been shutdown still be rebalancing? > > Zookeeper session timeout (zookeeper.session.timeout.ms< > http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms>) > is 1000 and sync time (zookeeper.sync.timeout.ms< > http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms>) is > 500. > > Also, the timeout for the thread that looks for the leader is left at the > default 200 milliseconds (refresh.leader.backoff.ms< > http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms>). > That's why we see these messages so often in our logs. > > I can imagine I need to tune some of these settings for load...but the > issue, I think, is that the consumer has been shutdown, so the ZkClient for > the leader finder thread no longer has a connection—and won't. > > :) > > On Jul 28, 2013, at 11:21 PM, "Jun Rao" <jun...@gmail.com<mailto: > jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com > >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>>>> wrote: > > Ok. So, it seems that the issue is there are lots of rebalances in the > consumer. How long did you set the zk session expiration time? A typical > reason for many rebalances is the consumer side GC. If so, you will see > Zookeeper session expirations in the consumer log (grep for Expired). > Occasional rebalances are fine. Too many rebalances can slow down the > consumption and one will need to tune the java GC setting. > > Thanks, > > Jun > > > On Sat, Jul 27, 2013 at 9:33 AM, Hargett, Phil < > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>>>> > wrote: > All bugs are relative, aren't they? :) > > Well, since this thread attempts to rebalance every 200 milliseconds, > these messages REALLY fill up a log and fast. > > Because this error results in so much log output, it makes it difficult to > find other actionable error messages in the log. > > Yes, I could suppress messages from that class (we use log4j after all) > but I am uncomfortable 1) hiding a thread leak, 2) hiding other possible > errors from the same class. > > I filed this as KAFKA 989 (IIRC), as I did not see an obvious bug that > covers it. > > This error also happens in less than 1 day of use: most of our systems in > this category are up for 2-3 months before a software upgrade or other > event causes us to cycle the process. > > I'm sure you have uptime and scaling requirements far beyond ours. So I > hope these reasons don't seem too petty. :) > > > On Jul 27, 2013, at 12:24 AM, "Jun Rao" <jun...@gmail.com<mailto: > jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com > >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>>><mailto:jun...@gmail.com > <mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com > >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>>>>> wrote: > > Other than those exceptions, what issues do you see in your consumer? > > Thanks, > > Jun > > > On Fri, Jul 26, 2013 at 9:24 AM, Hargett, Phil < > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > >>><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com>><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>>>>> wrote: > This NOT a harmless race. > > Now my QA teammate is encountering this issue under load. The result of it > is a background thread that is spinning in a loop that always hits a > NullPointerException. > > I have implemented a variety of assurances in my application code to > ensure that the high-level consumer I'm spinning up in Java stays alive for > at least 10 seconds before being asked to shutdown. Yet the issue still > persists. > > Suggestions? > ________________________________________ > From: Jun Rao [jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>><mailto:jun...@gmail.com<mailto: > jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com > >>><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>><mailto:jun...@gmail.com<mailto: > jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>>>] > Sent: Tuesday, June 25, 2013 11:58 PM > To: users@kafka.apache.org<mailto:users@kafka.apache.org><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org>>><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org><mailto: > users@kafka.apache.org<mailto:users@kafka.apache.org>>>> > Subject: Re: 0.8 throwing exception "Failed to find leader" and high-level > consumer fails to make progress > > The exception is likely due to a race condition btw the logic in ZK watcher > and the closing of ZK connection. It's harmless, except for the weird > exception. > > Thanks, > > Jun > > > On Tue, Jun 25, 2013 at 10:07 AM, Hargett, Phil < > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > >>><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com>><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>>>>> wrote: > > > Possibly. > > > > I see evidence that its being stopped / started every 30 seconds in same > > cases (due to my code). It's entirely possible that I have a race, too, > in > > that 2 separate pieces of code could be triggering such a stop / start. > > > > Gives me something to track down. Thank you!! > > > > On Jun 25, 2013, at 12:45 PM, "Jun Rao" <jun...@gmail.com<mailto: > jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com > >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>>><mailto:jun...@gmail.com > <mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com > >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto: > jun...@gmail.com<mailto:jun...@gmail.com>>>>> wrote: > > > > > This typically only happens when the consumerConnector is shut down. > Are > > > you restarting the consumerConnector often? > > > > > > Thanks, > > > > > > Jun > > > > > > > > > On Tue, Jun 25, 2013 at 9:40 AM, Hargett, Phil < > > > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > >>><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com > <mailto:phil.harg...@mirror-image.com>><mailto: > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com > ><mailto:phil.harg...@mirror-image.com<mailto: > phil.harg...@mirror-image.com>>>>> wrote: > > > > > >> Seeing this exception a LOT (3-4 times per second, same log topic). > > >> > > >> I'm using external code to feed data to about 50 different log topics > > over > > >> a cluster of 3 Kafka 0.8 brokers. There are 3 ZooKeeper instances as > > well, > > >> all of this is running on EC2. My application creates a high-level > > >> consumer (1 per topic) to consumer data from each and do further > > processing. > > >> > > >> The problem is this exception is in the high-level consumer, so my > code > > >> has no way of knowing that it's become stuck. > > >> > > >> This exception does not always appear, but as far as I can tell, once > > this > > >> happens, the only cure is to restart my application's process. > > >> > > >> I saw this in 0.8 built from source about 1 week ago, and also am > seeing > > >> it today after pulling the latest 0.8 sources and rebuilding Kafka. > > >> > > >> Thoughts? > > >> > > >> Failed to find leader for Set([topic6,0]): > > java.lang.NullPointerException > > >> at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416) > > >> at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413) > > >> at > > >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) > > >> at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413) > > >> at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409) > > >> at > > >> kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438) > > >> at > kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75) > > >> at > > >> > > > kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63) > > >> at > > kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) > > >> > > > > > > >