Just to expand on Lawrence's answer:  The increase in file descriptor usage
goes from 2-3K under normal conditions, to 64K+ under deadlock, which it
hits within a couple of hours, at which point the broker goes down, because
that's our OS-defined limit.

If it was only a 33% increase from the new timestamp indexes, we should be
going to max 4K-5K file descriptors in use, not 64K+.

Marcos


On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <lwei...@pandora.com> wrote:

> We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
> We’re now running on 0.10.1.0, and the FD increase is due to a deadlock,
> not functionality or new features.
>
> Lawrence Weikum | Software Engineer | Pandora
> 1426 Pearl Street, Suite 100, Boulder CO 80302
> m 720.203.1578 | lwei...@pandora.com
>
> On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> wrote:
>
>     The 0.10.1 broker will use more file descriptor than previous releases
>     because of the new timestamp indexes. You should expect and plan for
> ~33%
>     more file descriptors to be open.
>
>     -hans
>
>     /**
>      * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
>      * h...@confluent.io (650)924-2670
>      */
>
>     On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <mjua...@gmail.com>
> wrote:
>
>     > We're running into a recurrent deadlock issue in both our production
> and
>     > staging clusters, both using the latest 0.10.1 release.  The symptom
> we
>     > noticed was that, in servers in which kafka producer connections are
> short
>     > lived, every other day or so, we'd see file descriptors being
> exhausted,
>     > until the broker is restarted, or the broker runs out of file
> descriptors,
>     > and it goes down.  None of the clients are on 0.10.1 kafka jars,
> they're
>     > all using previous versions.
>     >
>     > When diagnosing the issue, we found that when the system is in that
> state,
>     > using up file descriptors at a really fast rate, the JVM is actually
> in a
>     > deadlock.  Did a thread dump from both jstack and visualvm, and
> attached
>     > those to this email.
>     >
>     > This is the interesting bit from the jstack thread dump:
>     >
>     >
>     > Found one Java-level deadlock:
>     > =============================
>     > "executor-Heartbeat":
>     >   waiting to lock monitor 0x00000000016c8138 (object
> 0x000000062732a398, a
>     > kafka.coordinator.GroupMetadata),
>     >   which is held by "group-metadata-manager-0"
>     >
>     > "group-metadata-manager-0":
>     >   waiting to lock monitor 0x00000000011ddaa8 (object
> 0x000000063f1b0cc0, a
>     > java.util.LinkedList),
>     >   which is held by "kafka-request-handler-3"
>     >
>     > "kafka-request-handler-3":
>     >   waiting to lock monitor 0x00000000016c8138 (object
> 0x000000062732a398, a
>     > kafka.coordinator.GroupMetadata),
>     >   which is held by "group-metadata-manager-0"
>     >
>     >
>     > I also noticed the background heartbeat thread (I'm guessing the one
>     > called "executor-Heartbeat" above) is new for this release, under
>     > KAFKA-3888 ticket - https://urldefense.proofpoint.
> com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
> 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
> VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
> jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
> vNriVw0wrEegk&e=
>     >
>     > We haven't noticed this problem with earlier Kafka broker versions,
> so I'm
>     > guessing maybe this new background heartbeat thread is what
> introduced the
>     > deadlock problem.
>     >
>     > That same broker is still in the deadlock scenario, we haven't
> restarted
>     > it, so let me know if you'd like more info/log/stats from the system
> before
>     > we restart it.
>     >
>     > Thanks,
>     >
>     > Marcos Juarez
>     >
>
>
>

Reply via email to