Hi Amit, The network stack has been redesigned for the upcoming Flink 1.5 release. The issue might have been fixed by that.
There's already a first release candidate for Flink 1.5.0 available [1]. It would be great if you would have the chance to check if the bug is still present. Best, Fabian [1] https://lists.apache.org/thread.html/a6b6fb1a42a975608fa8641c86df30b47f022985ade845f1f1ec542a@%3Cdev.flink.apache.org%3E 2018-04-04 20:23 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > I searched for 0x00000005e28fe218 in the two files you attached > to FLINK-2685 but didn't find any hit. > > Was this the same instance as the attachment to FLINK-2685 ? > > Thanks > > On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain <aj201...@gmail.com> wrote: > > > +u...@flink.apache.org > > > > On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain <aj201...@gmail.com> wrote: > > > Hi, > > > > > > We are hitting TaskManager deadlock on NetworkBufferPool bug in Flink > > 1.3.2. > > > We have set of ETL's merge jobs for a number of tables and stuck with > > above > > > issue randomly daily. > > > > > > I'm attaching the thread dump of JobManager and one of the Task Manager > > (T1) > > > running stuck job. > > > We also observed, sometimes new job scheduled on T1 progresses even > > another > > > job is stuck there. > > > > > > "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553) > > > (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Map > > (Map > > > at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)" > #1501 > > > daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in > > Object.wait() > > > [0x00007f9ebf102000] > > > java.lang.Thread.State: TIMED_WAITING (on object monitor) > > > at java.lang.Object.wait(Native Method) > > > at > > > org.apache.flink.runtime.io.network.buffer. > > LocalBufferPool.requestBuffer(LocalBufferPool.java:224) > > > - locked <0x00000005e28fe218> (a java.util.ArrayDeque) > > > at > > > org.apache.flink.runtime.io.network.buffer.LocalBufferPool. > > requestBufferBlocking(LocalBufferPool.java:193) > > > at > > > org.apache.flink.runtime.io.network.api.writer. > > RecordWriter.sendToTarget(RecordWriter.java:132) > > > - locked <0x00000005e29125f0> (a > > > org.apache.flink.runtime.io.network.api.serialization. > > SpanningRecordSerializer) > > > at > > > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit( > > RecordWriter.java:89) > > > at > > > org.apache.flink.runtime.operators.shipping.OutputCollector.collect( > > OutputCollector.java:65) > > > at > > > org.apache.flink.runtime.operators.util.metrics. > > CountingCollector.collect(CountingCollector.java:35) > > > at > > > org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect( > > ChainedMapDriver.java:79) > > > at > > > org.apache.flink.runtime.operators.util.metrics. > > CountingCollector.collect(CountingCollector.java:35) > > > at > > > org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect( > > ChainedMapDriver.java:79) > > > at > > > org.apache.flink.runtime.operators.util.metrics. > > CountingCollector.collect(CountingCollector.java:35) > > > at > > > org.apache.flink.runtime.operators.DataSourceTask. > > invoke(DataSourceTask.java:168) > > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) > > > at java.lang.Thread.run(Thread.java:748) > > > > > > -- > > > Thanks, > > > Amit > > >