[ https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174178#comment-14174178 ]
Ewen Cheslack-Postava commented on KAFKA-1710: ---------------------------------------------- This looks like a red herring due to the structure of the test. The test code generates 200 threads which share 4 producers, and each thread round-robins through the consumers, then sleeps for 10ms. It looks like all that's happening is that the profiling tool sees the same stack trace repeatedly because there's a huge amount of contention for the 4 producers. If you take a look at the stack traces, they're almost all waiting on a lock on a queue that the messages get appended to. The few active threads have those queues locked and are working on compressing data before sending it out. Given the number of threads and the small number of producers, it's not surprising that YourKit sees the same stack traces for a long time -- the threads can be making forward progress, but any time the profiler stops to look at the stack traces, it's very likely that any given thread will be waiting on a lock with the same stack trace. None of the stack traces show any evidence of a real deadlock (i.e. I can't find any set of locks where there could be ordering issues since almost every thread is just waiting on a one lock in one of the producers). If this did hit deadlock, the process should stop entirely because all the worker threads use all 4 producers and the supposedly deadlocked threads are all waiting on locks in the producer. I ran the test to completion multiple times without any issues. Unless this has actually been observed to hit deadlock and stop making progress, I think this should be closed since these messages are really just warnings from YourKit. [~Bmis13] you might try reducing the # of threads and seeing if those charts end up looking better. I bet if you actually showed all the threads instead of just the couple in the screenshot, the areas marked as runnable across all threads would sum to a reasonable total. Also, there are other possible issues with getting good performance from this test code, e.g. the round robin approach can cause all threads to get blocked on the same producer if the producer gets locked for a relatively long time. This can happen when data is ready to be sent and is getting compressed. Other approaches to distributing work across the producers may provide better throughput. > [New Java Producer Potential Deadlock] Producer Deadlock when all messages is > being sent to single partition > ------------------------------------------------------------------------------------------------------------ > > Key: KAFKA-1710 > URL: https://issues.apache.org/jira/browse/KAFKA-1710 > Project: Kafka > Issue Type: Bug > Components: producer > Environment: Development > Reporter: Bhavesh Mistry > Priority: Critical > Labels: performance > Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot > 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, > TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, > th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, > th6.dump, th7.dump, th8.dump, th9.dump > > > Hi Kafka Dev Team, > When I run the test to send message to single partition for 3 minutes or so > on, I have encounter deadlock (please see the screen attached) and thread > contention from YourKit profiling. > Use Case: > 1) Aggregating messages into same partition for metric counting. > 2) Replicate Old Producer behavior for sticking to partition for 3 minutes. > Here is output: > Frozen threads found (potential deadlock) > > It seems that the following threads have not changed their stack for more > than 10 seconds. > These threads are possibly (but not necessarily!) in a deadlock or hung. > > pool-1-thread-128 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-159 <--- Frozen for at least 2m 1 sec > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > pool-1-thread-55 <--- Frozen for at least 2m > org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, > byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139 > org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, > Callback) KafkaProducer.java:237 > org.kafka.test.TestNetworkDownProducer$MyProducer.run() > TestNetworkDownProducer.java:84 > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > ThreadPoolExecutor.java:1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run() > ThreadPoolExecutor.java:615 > java.lang.Thread.run() Thread.java:744 > Thanks, > Bhavesh -- This message was sent by Atlassian JIRA (v6.3.4#6332)