Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

Jun Rao Tue, 30 Jul 2013 09:03:35 -0700

What's the revision of the 0.8 branch that you used? If that's older than
the beta1 release, I recommend that you upgrade.


Thanks,

Jun


On Tue, Jul 30, 2013 at 3:09 AM, Hargett, Phil <
phil.harg...@mirror-image.com> wrote:

> No, sorry, it didn't take 90 seconds to connect to ZK (at least I hope
> not). I had my consumer open for 90 secs in this case before shutting it
> down and disposing of it—hence any races caused by fast startup/shutdown
> should not have been relevant.
>
> I build from source off of the 0.8 branch, so isn't that pretty close to
> beta 1?
>
> :)
>
> On Jul 30, 2013, at 12:22 AM, "Jun Rao" <jun...@gmail.com<mailto:
> jun...@gmail.com>> wrote:
>
> Hmm, it takes 90 secs to connect to ZK? That seems way too long. Is your
> ZK healthy.
>
> Also, are you on the 0.8 beta1 release? If not, could you try that one? It
> may not be related, but we did fix some consumer side deadlock issues there.
>
> Thanks,
>
> Jun
>
>
> On Mon, Jul 29, 2013 at 9:02 AM, Hargett, Phil <
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>>
> wrote:
> I think we have 3 different classes in play here:
>
>  * kafka.consumer.ZookeeperConsumerConnector
>  * kafka.utils.ShutdownableThread
>  * kafka.consumer.ConsumerFetcherManager
>
> The actual consumer is the first one, and it does a fair amount of work
> *before* stopping the fetcher—which then results in shutting down the
> leader thread
>
> If the initial connectZk method in ZookeeperConsumerConnector takes a long
> time (more than 90 seconds in 1 case this morning), then I could see the
> fetcher's stopConnections method not getting called, because there isn't a
> ConsumerFetcherManager instance yet.
>
> In other words, we could be shutting down the consumer before it is fully
> initialized—but that doesn't seem correct, as the code in
> ZookeeperConsumerConnector is synchronous—my application wouldn't have a
> reference to a partially initialized consumer.
>
> Odd.
>
> :)
>
> On Jul 29, 2013, at 11:22 AM, "Jun Rao" <jun...@gmail.com<mailto:
> jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>>
> wrote:
>
> There seems to be two separate issues.
>
> 1. Why do you see NullPointerException in the leaderFinder thread? I am
> not sure what's causing this. In the normal path, when a consumer connector
> is shut down (this is when the pointer is set to null), it first waits for
> the leaderFinder thread to shut down. Do you think that you can provide a
> test case that reproduces this and attach it to the jira?
>
> 2. It seems that you have lots of consumer rebalances. This is good to
> avoid since it can slow down the consumption.
>
> Thanks,
>
> Jun
>
>
> On Mon, Jul 29, 2013 at 6:21 AM, Hargett, Phil <
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>>> wrote:
> Why would a consumer that has been shutdown still be rebalancing?
>
> Zookeeper session timeout (zookeeper.session.timeout.ms<
> http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms>)
> is 1000 and sync time (zookeeper.sync.timeout.ms<
> http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms>) is
> 500.
>
> Also, the timeout for the thread that looks for the leader is left at the
> default 200 milliseconds (refresh.leader.backoff.ms<
> http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms>).
> That's why we see these messages so often in our logs.
>
> I can imagine I need to tune some of these settings for load...but the
> issue, I think, is that the consumer has been shutdown, so the ZkClient for
> the leader finder thread no longer has a connection—and won't.
>
> :)
>
> On Jul 28, 2013, at 11:21 PM, "Jun Rao" <jun...@gmail.com<mailto:
> jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com
> >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>>>> wrote:
>
> Ok. So, it seems that the issue is there are lots of rebalances in the
> consumer. How long did you set the zk session expiration time? A typical
> reason for many rebalances is the consumer side GC. If so, you will see
> Zookeeper session expirations in the consumer log (grep for Expired).
> Occasional rebalances are fine. Too many rebalances can slow down the
> consumption and one will need to tune the java GC setting.
>
> Thanks,
>
> Jun
>
>
> On Sat, Jul 27, 2013 at 9:33 AM, Hargett, Phil <
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>>>>
> wrote:
> All bugs are relative, aren't they? :)
>
> Well, since this thread attempts to rebalance every 200 milliseconds,
> these messages REALLY fill up a log and fast.
>
> Because this error results in so much log output, it makes it difficult to
> find other actionable error messages in the log.
>
> Yes, I could suppress messages from that class (we use log4j after all)
> but I am uncomfortable 1) hiding a thread leak, 2) hiding other possible
>  errors from the same class.
>
> I filed this as KAFKA 989 (IIRC), as I did not see an obvious bug that
> covers it.
>
> This error also happens in less than 1 day of use: most of our systems in
> this category are up for 2-3 months before a software upgrade or other
> event causes us to cycle the process.
>
> I'm sure you have uptime and scaling requirements far beyond ours. So I
> hope these reasons don't seem too petty. :)
>
>
> On Jul 27, 2013, at 12:24 AM, "Jun Rao" <jun...@gmail.com<mailto:
> jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com
> >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>>><mailto:jun...@gmail.com
> <mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com
> >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>>>>> wrote:
>
> Other than those exceptions, what issues do you see in your consumer?
>
> Thanks,
>
> Jun
>
>
> On Fri, Jul 26, 2013 at 9:24 AM, Hargett, Phil <
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> >>><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com>><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>>>>> wrote:
> This NOT a harmless race.
>
> Now my QA teammate is encountering this issue under load. The result of it
> is a background thread that is spinning in a loop that always hits a
> NullPointerException.
>
> I have implemented a variety of assurances in my application code to
> ensure that the high-level consumer I'm spinning up in Java stays alive for
> at least 10 seconds before being asked to shutdown.  Yet the issue still
> persists.
>
> Suggestions?
> ________________________________________
> From: Jun Rao [jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>><mailto:jun...@gmail.com<mailto:
> jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com
> >>><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>><mailto:jun...@gmail.com<mailto:
> jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>>>]
> Sent: Tuesday, June 25, 2013 11:58 PM
> To: users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org>>><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org>><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:
> users@kafka.apache.org<mailto:users@kafka.apache.org>>>>
> Subject: Re: 0.8 throwing exception "Failed to find leader" and high-level
> consumer fails to make progress
>
> The exception is likely due to a race condition btw the logic in ZK watcher
> and the closing of ZK connection. It's harmless, except for the weird
> exception.
>
> Thanks,
>
> Jun
>
>
> On Tue, Jun 25, 2013 at 10:07 AM, Hargett, Phil <
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> >>><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com>><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>>>>> wrote:
>
> > Possibly.
> >
> > I see evidence that its being stopped / started every 30 seconds in same
> > cases (due to my code). It's entirely possible that I have a race, too,
> in
> > that 2 separate pieces of code could be triggering such a stop / start.
> >
> > Gives me something to track down. Thank you!!
> >
> > On Jun 25, 2013, at 12:45 PM, "Jun Rao" <jun...@gmail.com<mailto:
> jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com
> >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>>><mailto:jun...@gmail.com
> <mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com
> >><mailto:jun...@gmail.com<mailto:jun...@gmail.com><mailto:
> jun...@gmail.com<mailto:jun...@gmail.com>>>>> wrote:
> >
> > > This typically only happens when the consumerConnector is shut down.
> Are
> > > you restarting the consumerConnector often?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Tue, Jun 25, 2013 at 9:40 AM, Hargett, Phil <
> > > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> >>><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com
> <mailto:phil.harg...@mirror-image.com>><mailto:
> phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com
> ><mailto:phil.harg...@mirror-image.com<mailto:
> phil.harg...@mirror-image.com>>>>> wrote:
> > >
> > >> Seeing this exception a LOT (3-4 times per second, same log topic).
> > >>
> > >> I'm using external code to feed data to about 50 different log topics
> > over
> > >> a cluster of 3 Kafka 0.8 brokers.  There are 3 ZooKeeper instances as
> > well,
> > >> all of this is running on EC2.  My application creates a high-level
> > >> consumer (1 per topic) to consumer data from each and do further
> > processing.
> > >>
> > >> The problem is this exception is in the high-level consumer, so my
> code
> > >> has no way of knowing that it's become stuck.
> > >>
> > >> This exception does not always appear, but as far as I can tell, once
> > this
> > >> happens, the only cure is to restart my application's process.
> > >>
> > >> I saw this in 0.8 built from source about 1 week ago, and also am
> seeing
> > >> it today after pulling the latest 0.8 sources and rebuilding Kafka.
> > >>
> > >> Thoughts?
> > >>
> > >> Failed to find leader for Set([topic6,0]):
> > java.lang.NullPointerException
> > >>        at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
> > >>        at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
> > >>        at
> > >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> > >>        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
> > >>        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
> > >>        at
> > >> kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
> > >>        at
> kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
> > >>        at
> > >>
> >
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
> > >>        at
> > kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> > >>
> >
>
>
>
>
>

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

Reply via email to