[ 
https://issues.apache.org/jira/browse/KAFKA-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174178#comment-14174178
 ] 

Ewen Cheslack-Postava commented on KAFKA-1710:
----------------------------------------------

This looks like a red herring due to the structure of the test. The test code 
generates 200 threads which share 4 producers, and each thread round-robins 
through the consumers, then sleeps for 10ms.

It looks like all that's happening is that the profiling tool sees the same 
stack trace repeatedly because there's a huge amount of contention for the 4 
producers. If you take a look at the stack traces, they're almost all waiting 
on a lock on a queue that the messages get appended to. The few active threads 
have those queues locked and are working on compressing data before sending it 
out. Given the number of threads and the small number of producers, it's not 
surprising that YourKit sees the same stack traces for a long time -- the 
threads can be making forward progress, but any time the profiler stops to look 
at the stack traces, it's very likely that any given thread will be waiting on 
a lock with the same stack trace. None of the stack traces show any evidence of 
a real deadlock (i.e. I can't find any set of locks where there could be 
ordering issues since almost every thread is just waiting on a one lock in one 
of the producers).

If this did hit deadlock, the process should stop entirely because all the 
worker threads use all 4 producers and the supposedly deadlocked threads are 
all waiting on locks in the producer. I ran the test to completion multiple 
times without any issues. Unless this has actually been observed to hit 
deadlock and stop making progress, I think this should be closed since these 
messages are really just warnings from YourKit.

[~Bmis13] you might try reducing the # of threads and seeing if those charts 
end up looking better. I bet if you actually showed all the threads instead of 
just the couple in the screenshot, the areas marked as runnable across all 
threads would sum to a reasonable total. Also, there are other possible issues 
with getting good performance from this test code, e.g. the round robin 
approach can cause all threads to get blocked on the same producer if the 
producer gets locked for a relatively long time. This can happen when data is 
ready to be sent and is getting compressed. Other approaches to distributing 
work across the producers may provide better throughput.


> [New Java Producer Potential Deadlock] Producer Deadlock when all messages is 
> being sent to single partition
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1710
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1710
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>         Environment: Development
>            Reporter: Bhavesh Mistry
>            Priority: Critical
>              Labels: performance
>         Attachments: Screen Shot 2014-10-13 at 10.19.04 AM.png, Screen Shot 
> 2014-10-15 at 9.09.06 PM.png, Screen Shot 2014-10-15 at 9.14.15 PM.png, 
> TestNetworkDownProducer.java, th1.dump, th10.dump, th11.dump, th12.dump, 
> th13.dump, th14.dump, th15.dump, th2.dump, th3.dump, th4.dump, th5.dump, 
> th6.dump, th7.dump, th8.dump, th9.dump
>
>
> Hi Kafka Dev Team,
> When I run the test to send message to single partition for 3 minutes or so 
> on, I have encounter deadlock (please see the screen attached) and thread 
> contention from YourKit profiling.  
> Use Case:
> 1)  Aggregating messages into same partition for metric counting. 
> 2)  Replicate Old Producer behavior for sticking to partition for 3 minutes.
> Here is output:
> Frozen threads found (potential deadlock)
>  
> It seems that the following threads have not changed their stack for more 
> than 10 seconds.
> These threads are possibly (but not necessarily!) in a deadlock or hung.
>  
> pool-1-thread-128 <--- Frozen for at least 2m
> org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition,
>  byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139
> org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, 
> Callback) KafkaProducer.java:237
> org.kafka.test.TestNetworkDownProducer$MyProducer.run() 
> TestNetworkDownProducer.java:84
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) 
> ThreadPoolExecutor.java:1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run() 
> ThreadPoolExecutor.java:615
> java.lang.Thread.run() Thread.java:744
> pool-1-thread-159 <--- Frozen for at least 2m 1 sec
> org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition,
>  byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139
> org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, 
> Callback) KafkaProducer.java:237
> org.kafka.test.TestNetworkDownProducer$MyProducer.run() 
> TestNetworkDownProducer.java:84
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) 
> ThreadPoolExecutor.java:1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run() 
> ThreadPoolExecutor.java:615
> java.lang.Thread.run() Thread.java:744
> pool-1-thread-55 <--- Frozen for at least 2m
> org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition,
>  byte[], byte[], CompressionType, Callback) RecordAccumulator.java:139
> org.apache.kafka.clients.producer.KafkaProducer.send(ProducerRecord, 
> Callback) KafkaProducer.java:237
> org.kafka.test.TestNetworkDownProducer$MyProducer.run() 
> TestNetworkDownProducer.java:84
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) 
> ThreadPoolExecutor.java:1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run() 
> ThreadPoolExecutor.java:615
> java.lang.Thread.run() Thread.java:744
> Thanks,
> Bhavesh 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to