Re: Deadlock using latest 0.10.1 Kafka release

Jason Gustafson Fri, 04 Nov 2016 12:03:35 -0700

Hi Marcos,

I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up the
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if you
need something in a hurry.


Thanks,
Jason

On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com> wrote:

> Jason,
>
> Thanks for that link.  It does appear to be a very similar issue, if not
> identical.  In our case, the deadlock is reported as across 3 threads, one
> of them being a group_metadata_manager thread. Otherwise, it looks the
> same.
>
> On your questions:
>
> - We did not see this in prior releases, but we are ramping up usage of our
> kafka clusters lately, so maybe we didn't have the needed volume before to
> trigger it.
>
> - Across our multiple staging and production clusters, we're seeing the
> problem roughly once or twice a day.
>
> - Our clusters are small at the moment.  The two that are experiencing the
> issue are 5 and 8 brokers, respectively.  The number of consumers is small,
> I'd say less than 20 at the moment.  The amount of data being produced is
> small also, in the tens of megabytes per second range, but the number of
> connects/disconnects is really high, because they're usually short-lived
> processes.  Our guess at the moment is that this is triggering the bug.  We
> have a separate cluster where we don't have short-lived producers, and that
> one has been rock solid.
>
>
> We'll look into applying the patch, which could help reduce the problem.
> The ticket says it's being targeted for the 0.10.2 release.  Any rough
> estimate of a timeline for that to come out?
>
> Thanks!
>
> Marcos
>
>
> On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Marcos,
> >
> > Thanks for the report. Can you check out
> > https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches?
> At
> > a glance, it looks like the same problem. We tried pretty hard to get the
> > fix into the release, but it didn't quite make it. A few questions:
> >
> > 1. Did you not see this in prior releases? As far as we can tell, it is
> > possible going back to 0.9.0.0, but there could be a subtle difference in
> > 0.10.1.0 which increases its likelihood.
> > 2. How often are you hitting this problem? You might try the patch that's
> > available if it's occurring frequently and you can't downgrade. I think
> the
> > existing patch is incomplete, but it should still reduce the likelihood.
> > 3. Out of curiosity, what is the size of your cluster and how many
> > consumers do you have in your cluster?
> >
> > Thanks!
> > Jason
> >
> > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <mjua...@gmail.com> wrote:
> >
> > > Just to expand on Lawrence's answer:  The increase in file descriptor
> > usage
> > > goes from 2-3K under normal conditions, to 64K+ under deadlock, which
> it
> > > hits within a couple of hours, at which point the broker goes down,
> > because
> > > that's our OS-defined limit.
> > >
> > > If it was only a 33% increase from the new timestamp indexes, we should
> > be
> > > going to max 4K-5K file descriptors in use, not 64K+.
> > >
> > > Marcos
> > >
> > >
> > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <lwei...@pandora.com>
> > > wrote:
> > >
> > > > We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
> > > > We’re now running on 0.10.1.0, and the FD increase is due to a
> > deadlock,
> > > > not functionality or new features.
> > > >
> > > > Lawrence Weikum | Software Engineer | Pandora
> > > > 1426 Pearl Street, Suite 100, Boulder CO 80302
> > > > m 720.203.1578 | lwei...@pandora.com
> > > >
> > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> wrote:
> > > >
> > > >     The 0.10.1 broker will use more file descriptor than previous
> > > releases
> > > >     because of the new timestamp indexes. You should expect and plan
> > for
> > > > ~33%
> > > >     more file descriptors to be open.
> > > >
> > > >     -hans
> > > >
> > > >     /**
> > > >      * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
> > > >      * h...@confluent.io (650)924-2670
> > > >      */
> > > >
> > > >     On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
> mjua...@gmail.com>
> > > > wrote:
> > > >
> > > >     > We're running into a recurrent deadlock issue in both our
> > > production
> > > > and
> > > >     > staging clusters, both using the latest 0.10.1 release.  The
> > > symptom
> > > > we
> > > >     > noticed was that, in servers in which kafka producer
> connections
> > > are
> > > > short
> > > >     > lived, every other day or so, we'd see file descriptors being
> > > > exhausted,
> > > >     > until the broker is restarted, or the broker runs out of file
> > > > descriptors,
> > > >     > and it goes down.  None of the clients are on 0.10.1 kafka
> jars,
> > > > they're
> > > >     > all using previous versions.
> > > >     >
> > > >     > When diagnosing the issue, we found that when the system is in
> > that
> > > > state,
> > > >     > using up file descriptors at a really fast rate, the JVM is
> > > actually
> > > > in a
> > > >     > deadlock.  Did a thread dump from both jstack and visualvm, and
> > > > attached
> > > >     > those to this email.
> > > >     >
> > > >     > This is the interesting bit from the jstack thread dump:
> > > >     >
> > > >     >
> > > >     > Found one Java-level deadlock:
> > > >     > =============================
> > > >     > "executor-Heartbeat":
> > > >     >   waiting to lock monitor 0x00000000016c8138 (object
> > > > 0x000000062732a398, a
> > > >     > kafka.coordinator.GroupMetadata),
> > > >     >   which is held by "group-metadata-manager-0"
> > > >     >
> > > >     > "group-metadata-manager-0":
> > > >     >   waiting to lock monitor 0x00000000011ddaa8 (object
> > > > 0x000000063f1b0cc0, a
> > > >     > java.util.LinkedList),
> > > >     >   which is held by "kafka-request-handler-3"
> > > >     >
> > > >     > "kafka-request-handler-3":
> > > >     >   waiting to lock monitor 0x00000000016c8138 (object
> > > > 0x000000062732a398, a
> > > >     > kafka.coordinator.GroupMetadata),
> > > >     >   which is held by "group-metadata-manager-0"
> > > >     >
> > > >     >
> > > >     > I also noticed the background heartbeat thread (I'm guessing
> the
> > > one
> > > >     > called "executor-Heartbeat" above) is new for this release,
> under
> > > >     > KAFKA-3888 ticket - https://urldefense.proofpoint.
> > > > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
> > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
> > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
> > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
> > > > vNriVw0wrEegk&e=
> > > >     >
> > > >     > We haven't noticed this problem with earlier Kafka broker
> > versions,
> > > > so I'm
> > > >     > guessing maybe this new background heartbeat thread is what
> > > > introduced the
> > > >     > deadlock problem.
> > > >     >
> > > >     > That same broker is still in the deadlock scenario, we haven't
> > > > restarted
> > > >     > it, so let me know if you'd like more info/log/stats from the
> > > system
> > > > before
> > > >     > we restart it.
> > > >     >
> > > >     > Thanks,
> > > >     >
> > > >     > Marcos Juarez
> > > >     >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Deadlock using latest 0.10.1 Kafka release

Reply via email to