Re: GC pauses and rebalance failures

David DeMaagd Mon, 14 Apr 2014 13:42:31 -0700

Deliberate variation of the retry/backoff parameters on a per-client basis 
is probably an even more complicated work-around than bumping up the session 
timeout.  I've never tried it because it doesn't really address the probable 
root cause (GC causing client stalls, zookeeper server dropping connections 
because it is timing-sensative, rebalances triggered by watches firing 
because of disconnections - it's problem with zookeeper clients that I am very
familiar with).


-- 
Dave DeMaagd | S'aite Reliability Engineering, Y'all
ddema...@linkedin.com | 818 262 7958

(cl...@breyman.com - Mon, Apr 14, 2014 at 01:26:43PM -0700)
> Thanks David. One hypothesis we have is that using different
> rebalance.backoff.ms settings for the different ConsumerConnectors on the
> same JVM will keep them from synchronizing their rebalance attempts enough
> so that one can finish.
> 
> 
> On Mon, Apr 14, 2014 at 12:58 PM, David DeMaagd <ddema...@linkedin.com>wrote:
> 
> > Correct - heavy client GC leads to numerous problems.  There's
> > two things you can do:
> >
> > 1) Tune the client JVM better to get GC to a more reasonable level
> > 2) Increase the zookeeper session timeout value (this is generally a
> >    work-around for #1, but it can buy you time to dig into it)
> >
> > --
> > Dave DeMaagd | S'aite Reliability Engineering, Y'all
> > ddema...@linkedin.com | 818 262 7958
> >
> > (cl...@breyman.com - Mon, Apr 14, 2014 at 12:41:15PM -0700)
> > > I've got some consumers under decent GC pressure and, as a result, they
> > are
> > > having ZK sessions expire and the consumers never recover. I see a number
> > > of rebalance failures in the log after the ZK session expiration followed
> > > by silence (and consumed partitions).
> > >
> > > My hypothesis is that, since the GC pause is global to the JVM, I'll have
> > > multiple ConsumerConnectors get expired at the same time and have
> > > synchronized rebalance/backoff cycles. Since rebalance fails if new
> > > consumers join mid balance, the multiple expired connectors will always
> > > collide with each other while attempting to rebalance.
> > >
> > > Is this hypothesis crazy? If not, is there a more likely situation? If
> > the
> > > hypothesis isn't crazy, how might I avoid this when the JVM is under GC
> > > pressure?
> > >
> > > Thanks in advance.
> >

Re: GC pauses and rebalance failures

Reply via email to