Just to expand on Lawrence's answer: The increase in file descriptor usage goes from 2-3K under normal conditions, to 64K+ under deadlock, which it hits within a couple of hours, at which point the broker goes down, because that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we should be going to max 4K-5K file descriptors in use, not 64K+. Marcos On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <lwei...@pandora.com> wrote: > We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1. > We’re now running on 0.10.1.0, and the FD increase is due to a deadlock, > not functionality or new features. > > Lawrence Weikum | Software Engineer | Pandora > 1426 Pearl Street, Suite 100, Boulder CO 80302 > m 720.203.1578 | lwei...@pandora.com > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> wrote: > > The 0.10.1 broker will use more file descriptor than previous releases > because of the new timestamp indexes. You should expect and plan for > ~33% > more file descriptors to be open. > > -hans > > /** > * Hans Jespersen, Principal Systems Engineer, Confluent Inc. > * h...@confluent.io (650)924-2670 > */ > > On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <mjua...@gmail.com> > wrote: > > > We're running into a recurrent deadlock issue in both our production > and > > staging clusters, both using the latest 0.10.1 release. The symptom > we > > noticed was that, in servers in which kafka producer connections are > short > > lived, every other day or so, we'd see file descriptors being > exhausted, > > until the broker is restarted, or the broker runs out of file > descriptors, > > and it goes down. None of the clients are on 0.10.1 kafka jars, > they're > > all using previous versions. > > > > When diagnosing the issue, we found that when the system is in that > state, > > using up file descriptors at a really fast rate, the JVM is actually > in a > > deadlock. Did a thread dump from both jstack and visualvm, and > attached > > those to this email. > > > > This is the interesting bit from the jstack thread dump: > > > > > > Found one Java-level deadlock: > > ============================= > > "executor-Heartbeat": > > waiting to lock monitor 0x00000000016c8138 (object > 0x000000062732a398, a > > kafka.coordinator.GroupMetadata), > > which is held by "group-metadata-manager-0" > > > > "group-metadata-manager-0": > > waiting to lock monitor 0x00000000011ddaa8 (object > 0x000000063f1b0cc0, a > > java.util.LinkedList), > > which is held by "kafka-request-handler-3" > > > > "kafka-request-handler-3": > > waiting to lock monitor 0x00000000016c8138 (object > 0x000000062732a398, a > > kafka.coordinator.GroupMetadata), > > which is held by "group-metadata-manager-0" > > > > > > I also noticed the background heartbeat thread (I'm guessing the one > > called "executor-Heartbeat" above) is new for this release, under > > KAFKA-3888 ticket - https://urldefense.proofpoint. > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA- > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r= > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N- > jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb > vNriVw0wrEegk&e= > > > > We haven't noticed this problem with earlier Kafka broker versions, > so I'm > > guessing maybe this new background heartbeat thread is what > introduced the > > deadlock problem. > > > > That same broker is still in the deadlock scenario, we haven't > restarted > > it, so let me know if you'd like more info/log/stats from the system > before > > we restart it. > > > > Thanks, > > > > Marcos Juarez > > > > >