Gian Merlino created KAFKA-7697:
-----------------------------------

             Summary: Possible deadlock in kafka.cluster.Partition
                 Key: KAFKA-7697
                 URL: https://issues.apache.org/jira/browse/KAFKA-7697
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 2.1.0
            Reporter: Gian Merlino
         Attachments: threaddump.txt

After upgrading a fairly busy broker from 0.10.2.0 to 2.1.0, it locked up 
within a few minutes (by "locked up" I mean that all request handler threads 
were busy, and other brokers reported that they couldn't communicate with it). 
I restarted it a few times and it did the same thing each time. After 
downgrading to 0.10.2.0, the broker was stable. I attached a thread dump from 
the last attempt on 2.1.0 that shows lots of kafka-request-handler- threads 
trying to acquire the leaderIsrUpdateLock lock in kafka.cluster.Partition.

It jumps out that there are two threads that already have some read lock (can't 
tell which one) and are trying to acquire a second one (on two different read 
locks: 0x0000000708184b88 and 0x000000070821f188): kafka-request-handler-1 and 
kafka-request-handler-4. Both are handling a produce request, and in the 
process of doing so, are calling Partition.fetchOffsetSnapshot while trying to 
complete a DelayedFetch. At the same time, both of those locks have writers 
from other threads waiting on them (kafka-request-handler-2 and 
kafka-scheduler-6). Neither of those locks appear to have writers that hold 
them (if only because no threads in the dump are deep enough in inWriteLock to 
indicate that).

ReentrantReadWriteLock in nonfair mode prioritizes waiting writers over 
readers. Is it possible that kafka-request-handler-1 and 
kafka-request-handler-4 are each trying to read-lock the partition that is 
currently locked by the other one, and they're both parked waiting for 
kafka-request-handler-2 and kafka-scheduler-6 to get write locks, which they 
never will, because the former two threads own read locks and aren't giving 
them up?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to