We're running into a recurrent deadlock issue in both our production and staging clusters, both using the latest 0.10.1 release. The symptom we noticed was that, in servers in which kafka producer connections are short lived, every other day or so, we'd see file descriptors being exhausted, until the broker is restarted, or the broker runs out of file descriptors, and it goes down. None of the clients are on 0.10.1 kafka jars, they're all using previous versions.
When diagnosing the issue, we found that when the system is in that state, using up file descriptors at a really fast rate, the JVM is actually in a deadlock. Did a thread dump from both jstack and visualvm, and attached those to this email. This is the interesting bit from the jstack thread dump: Found one Java-level deadlock: ============================= "executor-Heartbeat": waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a kafka.coordinator.GroupMetadata), which is held by "group-metadata-manager-0" "group-metadata-manager-0": waiting to lock monitor 0x00000000011ddaa8 (object 0x000000063f1b0cc0, a java.util.LinkedList), which is held by "kafka-request-handler-3" "kafka-request-handler-3": waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a kafka.coordinator.GroupMetadata), which is held by "group-metadata-manager-0" I also noticed the background heartbeat thread (I'm guessing the one called "executor-Heartbeat" above) is new for this release, under KAFKA-3888 ticket - https://issues.apache.org/jira/browse/KAFKA-3888 We haven't noticed this problem with earlier Kafka broker versions, so I'm guessing maybe this new background heartbeat thread is what introduced the deadlock problem. That same broker is still in the deadlock scenario, we haven't restarted it, so let me know if you'd like more info/log/stats from the system before we restart it. Thanks, Marcos Juarez