Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

Hargett, Phil Mon, 29 Jul 2013 06:23:34 -0700

Why would a consumer that has been shutdown still be rebalancing?

Zookeeper session timeout (zookeeper.session.timeout.ms) is 1000 and sync time 
(zookeeper.sync.timeout.ms) is 500.

Also, the timeout for the thread that looks for the leader is left at the 
default 200 milliseconds (refresh.leader.backoff.ms). That's why we see these 
messages so often in our logs.

I can imagine I need to tune some of these settings for load...but the issue, I 
think, is that the consumer has been shutdown, so the ZkClient for the leader 
finder thread no longer has a connection—and won't.

:)

On Jul 28, 2013, at 11:21 PM, "Jun Rao" 
<jun...@gmail.com<mailto:jun...@gmail.com>> wrote:

Ok. So, it seems that the issue is there are lots of rebalances in the 
consumer. How long did you set the zk session expiration time? A typical reason 
for many rebalances is the consumer side GC. If so, you will see Zookeeper 
session expirations in the consumer log (grep for Expired). Occasional 
rebalances are fine. Too many rebalances can slow down the consumption and one 
will need to tune the java GC setting.

Thanks,

Jun

On Sat, Jul 27, 2013 at 9:33 AM, Hargett, Phil 
<phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>> wrote:
All bugs are relative, aren't they? :)

Well, since this thread attempts to rebalance every 200 milliseconds, these 
messages REALLY fill up a log and fast.

Because this error results in so much log output, it makes it difficult to find 
other actionable error messages in the log.

Yes, I could suppress messages from that class (we use log4j after all) but I 
am uncomfortable 1) hiding a thread leak, 2) hiding other possible  errors from 
the same class.

I filed this as KAFKA 989 (IIRC), as I did not see an obvious bug that covers 
it.

This error also happens in less than 1 day of use: most of our systems in this 
category are up for 2-3 months before a software upgrade or other event causes 
us to cycle the process.

I'm sure you have uptime and scaling requirements far beyond ours. So I hope 
these reasons don't seem too petty. :)

On Jul 27, 2013, at 12:24 AM, "Jun Rao" 
<jun...@gmail.com<mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>>
 wrote:

Other than those exceptions, what issues do you see in your consumer?

Thanks,

Jun

On Fri, Jul 26, 2013 at 9:24 AM, Hargett, Phil 
<phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>>>
 wrote:
This NOT a harmless race.

Now my QA teammate is encountering this issue under load. The result of it is a 
background thread that is spinning in a loop that always hits a 
NullPointerException.

I have implemented a variety of assurances in my application code to ensure 
that the high-level consumer I'm spinning up in Java stays alive for at least 
10 seconds before being asked to shutdown.  Yet the issue still persists.

Suggestions?
________________________________________
From: Jun Rao 
[jun...@gmail.com<mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>]
Sent: Tuesday, June 25, 2013 11:58 PM
To: 
users@kafka.apache.org<mailto:users@kafka.apache.org><mailto:users@kafka.apache.org<mailto:users@kafka.apache.org>>
Subject: Re: 0.8 throwing exception "Failed to find leader" and high-level 
consumer fails to make progress

The exception is likely due to a race condition btw the logic in ZK watcher
and the closing of ZK connection. It's harmless, except for the weird
exception.

Thanks,

Jun

On Tue, Jun 25, 2013 at 10:07 AM, Hargett, Phil <
phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>>>
 wrote:

> Possibly.
>
> I see evidence that its being stopped / started every 30 seconds in same
> cases (due to my code). It's entirely possible that I have a race, too, in
> that 2 separate pieces of code could be triggering such a stop / start.
>
> Gives me something to track down. Thank you!!
>
> On Jun 25, 2013, at 12:45 PM, "Jun Rao" 
> <jun...@gmail.com<mailto:jun...@gmail.com><mailto:jun...@gmail.com<mailto:jun...@gmail.com>>>
>  wrote:
>
> > This typically only happens when the consumerConnector is shut down. Are
> > you restarting the consumerConnector often?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Tue, Jun 25, 2013 at 9:40 AM, Hargett, Phil <
> > phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com><mailto:phil.harg...@mirror-image.com<mailto:phil.harg...@mirror-image.com>>>
> >  wrote:
> >
> >> Seeing this exception a LOT (3-4 times per second, same log topic).
> >>
> >> I'm using external code to feed data to about 50 different log topics
> over
> >> a cluster of 3 Kafka 0.8 brokers.  There are 3 ZooKeeper instances as
> well,
> >> all of this is running on EC2.  My application creates a high-level
> >> consumer (1 per topic) to consumer data from each and do further
> processing.
> >>
> >> The problem is this exception is in the high-level consumer, so my code
> >> has no way of knowing that it's become stuck.
> >>
> >> This exception does not always appear, but as far as I can tell, once
> this
> >> happens, the only cure is to restart my application's process.
> >>
> >> I saw this in 0.8 built from source about 1 week ago, and also am seeing
> >> it today after pulling the latest 0.8 sources and rebuilding Kafka.
> >>
> >> Thoughts?
> >>
> >> Failed to find leader for Set([topic6,0]):
> java.lang.NullPointerException
> >>        at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416)
> >>        at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413)
> >>        at
> >> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> >>        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413)
> >>        at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409)
> >>        at
> >> kafka.utils.ZkUtils$.getChildrenParentMayNotExist(ZkUtils.scala:438)
> >>        at kafka.utils.ZkUtils$.getAllBrokersInCluster(ZkUtils.scala:75)
> >>        at
> >>
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:63)
> >>        at
> kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> >>
>

Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_

Reply via email to