Stephan Ewen created FLINK-2341: ----------------------------------- Summary: Deadlock in SpilledSubpartitionViewAsyncIO Key: FLINK-2341 URL: https://issues.apache.org/jira/browse/FLINK-2341 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9, 0.10 Reporter: Stephan Ewen Assignee: Ufuk Celebi Priority: Critical Fix For: 0.9, 0.10
It may be that the deadlock is because of the way the {{SpilledSubpartitionViewTest}} is written {code} Found one Java-level deadlock: ============================= "pool-25-thread-2": waiting to lock monitor 0x00007f66f4932468 (object 0x00000000fa1478f0, a java.lang.Object), which is held by "IOManager reader thread #1" "IOManager reader thread #1": waiting to lock monitor 0x00007f66f4931160 (object 0x00000000fa029768, a java.lang.Object), which is held by "pool-25-thread-2" Java stack information for the threads listed above: =================================================== "pool-25-thread-2": at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.notifyError(SpilledSubpartitionViewAsyncIO.java:304) - waiting to lock <0x00000000fa1478f0> (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.onAvailableBuffer(SpilledSubpartitionViewAsyncIO.java:256) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.access$300(SpilledSubpartitionViewAsyncIO.java:42) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$BufferProviderCallback.onEvent(SpilledSubpartitionViewAsyncIO.java:367) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$BufferProviderCallback.onEvent(SpilledSubpartitionViewAsyncIO.java:353) at org.apache.flink.runtime.io.network.util.TestPooledBufferProvider$PooledBufferProviderRecycler.recycle(TestPooledBufferProvider.java:135) - locked <0x00000000fa029768> (a java.lang.Object) at org.apache.flink.runtime.io.network.buffer.Buffer.recycle(Buffer.java:119) - locked <0x00000000fa3a1a20> (a java.lang.Object) at org.apache.flink.runtime.io.network.util.TestSubpartitionConsumer.call(TestSubpartitionConsumer.java:95) at org.apache.flink.runtime.io.network.util.TestSubpartitionConsumer.call(TestSubpartitionConsumer.java:39) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) "IOManager reader thread #1": at org.apache.flink.runtime.io.network.util.TestPooledBufferProvider$PooledBufferProviderRecycler.recycle(TestPooledBufferProvider.java:127) - waiting to lock <0x00000000fa029768> (a java.lang.Object) at org.apache.flink.runtime.io.network.buffer.Buffer.recycle(Buffer.java:119) - locked <0x00000000fa3a1ea0> (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.returnBufferFromIOThread(SpilledSubpartitionViewAsyncIO.java:270) - locked <0x00000000fa1478f0> (a java.lang.Object) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.access$100(SpilledSubpartitionViewAsyncIO.java:42) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$IOThreadCallback.requestSuccessful(SpilledSubpartitionViewAsyncIO.java:338) at org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$IOThreadCallback.requestSuccessful(SpilledSubpartitionViewAsyncIO.java:328) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.handleProcessedBuffer(AsynchronousFileIOChannel.java:199) at org.apache.flink.runtime.io.disk.iomanager.BufferReadRequest.requestDone(AsynchronousFileIOChannel.java:431) at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync$ReaderThread.run(IOManagerAsync.java:377) {code} The full log with the deadlock stack traces can be found here: https://s3.amazonaws.com/archive.travis-ci.org/jobs/70232347/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)