Thanks Becket. I was working on that today. I have a working jar, created from the 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've left it running in one test broker today, will try tomorrow to trigger the issue, and try it with both the patched and un-patched versions.
I'll let you know what we find. Thanks, Marcos On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin <becket....@gmail.com> wrote: > Hi Marcos, > > Is it possible for you to apply the patch of KAFKA-3994 and see if the > issue is still there. The current patch of KAFKA-3994 should work, the only > reason we haven't checked that in was because when we ran stress test it > shows noticeable performance impact when producers are producing with > acks=all. So if you are blocking on this issue maybe you can pick up the > patch as a short term solution. Meanwhile we will prioritize the ticket. > > Thanks, > > Jiangjie (Becket) Qin > > On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez <mjua...@gmail.com> wrote: > > > We ran into this issue several more times over the weekend. Basically, > > FDs are exhausted so fast now, we can't even get to the server in time, > the > > JVM goes down in less than 5 minutes. > > > > I can send the whole thread dumps if needed, but for brevity's sake, I > > just copied over the relevant deadlock segment, and concatenated them all > > together in the attached text file. > > > > Do you think this is something I should add to KAFKA-3994 ticket? Or is > > the information in that ticket enough for now? > > > > Thanks, > > > > Marcos > > > > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com> wrote: > > > >> That's great, thanks Jason. > >> > >> We'll try and apply the patch in the meantime, and wait for the official > >> release for 0.10.1.1. > >> > >> Please let us know if you need more details about the deadlocks on our > >> side. > >> > >> Thanks again! > >> > >> Marcos > >> > >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <ja...@confluent.io> > >> wrote: > >> > >>> Hi Marcos, > >>> > >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since > >>> we're now seeing users hit this in practice, I'll definitely bump up > the > >>> priority on a fix. I can't say for sure when the release will be, but > >>> we'll > >>> merge the fix into the 0.10.1 branch and you can build from there if > you > >>> need something in a hurry. > >>> > >>> Thanks, > >>> Jason > >>> > >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com> > wrote: > >>> > >>> > Jason, > >>> > > >>> > Thanks for that link. It does appear to be a very similar issue, if > >>> not > >>> > identical. In our case, the deadlock is reported as across 3 > threads, > >>> one > >>> > of them being a group_metadata_manager thread. Otherwise, it looks > the > >>> > same. > >>> > > >>> > On your questions: > >>> > > >>> > - We did not see this in prior releases, but we are ramping up usage > >>> of our > >>> > kafka clusters lately, so maybe we didn't have the needed volume > >>> before to > >>> > trigger it. > >>> > > >>> > - Across our multiple staging and production clusters, we're seeing > the > >>> > problem roughly once or twice a day. > >>> > > >>> > - Our clusters are small at the moment. The two that are > experiencing > >>> the > >>> > issue are 5 and 8 brokers, respectively. The number of consumers is > >>> small, > >>> > I'd say less than 20 at the moment. The amount of data being > produced > >>> is > >>> > small also, in the tens of megabytes per second range, but the number > >>> of > >>> > connects/disconnects is really high, because they're usually > >>> short-lived > >>> > processes. Our guess at the moment is that this is triggering the > >>> bug. We > >>> > have a separate cluster where we don't have short-lived producers, > and > >>> that > >>> > one has been rock solid. > >>> > > >>> > > >>> > We'll look into applying the patch, which could help reduce the > >>> problem. > >>> > The ticket says it's being targeted for the 0.10.2 release. Any > rough > >>> > estimate of a timeline for that to come out? > >>> > > >>> > Thanks! > >>> > > >>> > Marcos > >>> > > >>> > > >>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <ja...@confluent.io> > >>> > wrote: > >>> > > >>> > > Hey Marcos, > >>> > > > >>> > > Thanks for the report. Can you check out > >>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if it > >>> matches? > >>> > At > >>> > > a glance, it looks like the same problem. We tried pretty hard to > >>> get the > >>> > > fix into the release, but it didn't quite make it. A few questions: > >>> > > > >>> > > 1. Did you not see this in prior releases? As far as we can tell, > it > >>> is > >>> > > possible going back to 0.9.0.0, but there could be a subtle > >>> difference in > >>> > > 0.10.1.0 which increases its likelihood. > >>> > > 2. How often are you hitting this problem? You might try the patch > >>> that's > >>> > > available if it's occurring frequently and you can't downgrade. I > >>> think > >>> > the > >>> > > existing patch is incomplete, but it should still reduce the > >>> likelihood. > >>> > > 3. Out of curiosity, what is the size of your cluster and how many > >>> > > consumers do you have in your cluster? > >>> > > > >>> > > Thanks! > >>> > > Jason > >>> > > > >>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <mjua...@gmail.com> > >>> wrote: > >>> > > > >>> > > > Just to expand on Lawrence's answer: The increase in file > >>> descriptor > >>> > > usage > >>> > > > goes from 2-3K under normal conditions, to 64K+ under deadlock, > >>> which > >>> > it > >>> > > > hits within a couple of hours, at which point the broker goes > down, > >>> > > because > >>> > > > that's our OS-defined limit. > >>> > > > > >>> > > > If it was only a 33% increase from the new timestamp indexes, we > >>> should > >>> > > be > >>> > > > going to max 4K-5K file descriptors in use, not 64K+. > >>> > > > > >>> > > > Marcos > >>> > > > > >>> > > > > >>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum < > >>> lwei...@pandora.com> > >>> > > > wrote: > >>> > > > > >>> > > > > We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1. > >>> > > > > We’re now running on 0.10.1.0, and the FD increase is due to a > >>> > > deadlock, > >>> > > > > not functionality or new features. > >>> > > > > > >>> > > > > Lawrence Weikum | Software Engineer | Pandora > >>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302 > >>> > > > > m 720.203.1578 | lwei...@pandora.com > >>> > > > > > >>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> > >>> wrote: > >>> > > > > > >>> > > > > The 0.10.1 broker will use more file descriptor than > previous > >>> > > > releases > >>> > > > > because of the new timestamp indexes. You should expect and > >>> plan > >>> > > for > >>> > > > > ~33% > >>> > > > > more file descriptors to be open. > >>> > > > > > >>> > > > > -hans > >>> > > > > > >>> > > > > /** > >>> > > > > * Hans Jespersen, Principal Systems Engineer, Confluent > Inc. > >>> > > > > * h...@confluent.io (650)924-2670 > >>> > > > > */ > >>> > > > > > >>> > > > > On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez < > >>> > mjua...@gmail.com> > >>> > > > > wrote: > >>> > > > > > >>> > > > > > We're running into a recurrent deadlock issue in both our > >>> > > > production > >>> > > > > and > >>> > > > > > staging clusters, both using the latest 0.10.1 release. > >>> The > >>> > > > symptom > >>> > > > > we > >>> > > > > > noticed was that, in servers in which kafka producer > >>> > connections > >>> > > > are > >>> > > > > short > >>> > > > > > lived, every other day or so, we'd see file descriptors > >>> being > >>> > > > > exhausted, > >>> > > > > > until the broker is restarted, or the broker runs out of > >>> file > >>> > > > > descriptors, > >>> > > > > > and it goes down. None of the clients are on 0.10.1 > kafka > >>> > jars, > >>> > > > > they're > >>> > > > > > all using previous versions. > >>> > > > > > > >>> > > > > > When diagnosing the issue, we found that when the system > >>> is in > >>> > > that > >>> > > > > state, > >>> > > > > > using up file descriptors at a really fast rate, the JVM > is > >>> > > > actually > >>> > > > > in a > >>> > > > > > deadlock. Did a thread dump from both jstack and > >>> visualvm, and > >>> > > > > attached > >>> > > > > > those to this email. > >>> > > > > > > >>> > > > > > This is the interesting bit from the jstack thread dump: > >>> > > > > > > >>> > > > > > > >>> > > > > > Found one Java-level deadlock: > >>> > > > > > ============================= > >>> > > > > > "executor-Heartbeat": > >>> > > > > > waiting to lock monitor 0x00000000016c8138 (object > >>> > > > > 0x000000062732a398, a > >>> > > > > > kafka.coordinator.GroupMetadata), > >>> > > > > > which is held by "group-metadata-manager-0" > >>> > > > > > > >>> > > > > > "group-metadata-manager-0": > >>> > > > > > waiting to lock monitor 0x00000000011ddaa8 (object > >>> > > > > 0x000000063f1b0cc0, a > >>> > > > > > java.util.LinkedList), > >>> > > > > > which is held by "kafka-request-handler-3" > >>> > > > > > > >>> > > > > > "kafka-request-handler-3": > >>> > > > > > waiting to lock monitor 0x00000000016c8138 (object > >>> > > > > 0x000000062732a398, a > >>> > > > > > kafka.coordinator.GroupMetadata), > >>> > > > > > which is held by "group-metadata-manager-0" > >>> > > > > > > >>> > > > > > > >>> > > > > > I also noticed the background heartbeat thread (I'm > >>> guessing > >>> > the > >>> > > > one > >>> > > > > > called "executor-Heartbeat" above) is new for this > release, > >>> > under > >>> > > > > > KAFKA-3888 ticket - https://urldefense.proofpoint. > >>> > > > > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA- > >>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo0 > 7QsHw-JRepxyw&r= > >>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N- > >>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0&s= > nEcLEnYWPyaDuPDI5vSSKPWoljoXYb > >>> > > > > vNriVw0wrEegk&e= > >>> > > > > > > >>> > > > > > We haven't noticed this problem with earlier Kafka broker > >>> > > versions, > >>> > > > > so I'm > >>> > > > > > guessing maybe this new background heartbeat thread is > what > >>> > > > > introduced the > >>> > > > > > deadlock problem. > >>> > > > > > > >>> > > > > > That same broker is still in the deadlock scenario, we > >>> haven't > >>> > > > > restarted > >>> > > > > > it, so let me know if you'd like more info/log/stats from > >>> the > >>> > > > system > >>> > > > > before > >>> > > > > > we restart it. > >>> > > > > > > >>> > > > > > Thanks, > >>> > > > > > > >>> > > > > > Marcos Juarez > >>> > > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >> > >> > > >