Ufuk Celebi created FLINK-2091: ---------------------------------- Summary: Lock contention during release of network buffer pools Key: FLINK-2091 URL: https://issues.apache.org/jira/browse/FLINK-2091 Project: Flink Issue Type: Improvement Components: Distributed Runtime Affects Versions: master Reporter: Ufuk Celebi Assignee: Ufuk Celebi
[~rmetzger] reported the following stack traces during cancelling of high parallelism jobs: {code} 13:43:46,803 WARN org.apache.flink.runtime.taskmanager.Task - Task 'DataSource (at main(Job.java:59) (org.apache.flink.api.java.io.TextInputFormat)) (4/16)' did not react to cancelling signal, but is stuck in method: org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:238) org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:268) org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:218) org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221) org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302) org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366) org.apache.flink.runtime.taskmanager.Task.run(Task.java:647) java.lang.Thread.run(Thread.java:745) {code} {code} 13:42:57,595 WARN org.apache.flink.runtime.taskmanager.Task - Task 'DataSource (at main(Job.java:59) (org.apache.flink.api.java.io.TextInputFormat)) (16/16)' did not react to cancelling signal, but is stuck in method: org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:212) org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221) org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302) org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366) org.apache.flink.runtime.taskmanager.Task.run(Task.java:647) java.lang.Thread.run(Thread.java:745) {code} The issue is that during cancelling of high parallelism jobs the locks for buffer pool management are highly contended. -- This message was sent by Atlassian JIRA (v6.3.4#6332)