Ufuk Celebi created FLINK-2091:
----------------------------------

             Summary: Lock contention during release of network buffer pools
                 Key: FLINK-2091
                 URL: https://issues.apache.org/jira/browse/FLINK-2091
             Project: Flink
          Issue Type: Improvement
          Components: Distributed Runtime
    Affects Versions: master
            Reporter: Ufuk Celebi
            Assignee: Ufuk Celebi


[~rmetzger] reported the following stack traces during cancelling of high 
parallelism jobs:

{code}
13:43:46,803 WARN  org.apache.flink.runtime.taskmanager.Task                    
 - Task 'DataSource (at main(Job.java:59) 
(org.apache.flink.api.java.io.TextInputFormat)) (4/16)' did not react to 
cancelling signal, but is stuck in method:
 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:238)
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:268)
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:218)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221)
org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302)
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:647)
java.lang.Thread.run(Thread.java:745)
{code}

{code}
13:42:57,595 WARN  org.apache.flink.runtime.taskmanager.Task                    
 - Task 'DataSource (at main(Job.java:59) 
(org.apache.flink.api.java.io.TextInputFormat)) (16/16)' did not react to 
cancelling signal, but is stuck in method:
 
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:212)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221)
org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302)
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:647)
java.lang.Thread.run(Thread.java:745)
{code}

The issue is that during cancelling of high parallelism jobs the locks for 
buffer pool management are highly contended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to