Hey Amit! Thanks for posting this here. I don't think it's an issue of the buffer pool per se. Instead I think there are two potential causes here:
1. The generated flow doesn't use blocking intermediate results for a branching-joining flow. => I think we can check it if you run and post the output of `StreamExecutionEnvironment#getExecutionPlan()` here. Can you please post the result of this here? 2. The blocking intermediate result is used but there is an issue with the implementation of them. => Depending on the output of 1, we can investigate this option. As Fabian mentioned, running this with a newer version of Flink will be very helpful. If the problem still persists, it will also make it more likely that the issue will be fixed faster. ;-) – Ufuk On Fri, Apr 6, 2018 at 5:43 AM, Nico Kruber <n...@data-artisans.com> wrote: > I'm not aware of any changes regarding the blocking buffer pools though. > > Is it really stuck or just making progress slowly? (You can check with > the number or records sent/received in the Web UI) > > Anyway, this may also simply mean that the task is back-pressured > depending on how the operators are wired together. In that case, all > available buffers for that ResultPartition have been used (records were > serialized into them) and are now waiting on Netty to send or a > SpillingSubpartition to spill data to disk. > Please also check for warnings or errors in the affected TaskManager's > log files. > > If you can reproduce the problem, could you try reducing your program to > a minimal working example and provide it here for further debugging? > > > Thanks > Nico > > On 04/04/18 23:00, Fabian Hueske wrote: >> Hi Amit, >> >> The network stack has been redesigned for the upcoming Flink 1.5 release. >> The issue might have been fixed by that. >> >> There's already a first release candidate for Flink 1.5.0 available [1]. >> It would be great if you would have the chance to check if the bug is still >> present. >> >> Best, Fabian >> >> [1] >> https://lists.apache.org/thread.html/a6b6fb1a42a975608fa8641c86df30b47f022985ade845f1f1ec542a@%3Cdev.flink.apache.org%3E >> >> 2018-04-04 20:23 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: >> >>> I searched for 0x00000005e28fe218 in the two files you attached >>> to FLINK-2685 but didn't find any hit. >>> >>> Was this the same instance as the attachment to FLINK-2685 ? >>> >>> Thanks >>> >>> On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain <aj201...@gmail.com> wrote: >>> >>>> +u...@flink.apache.org >>>> >>>> On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain <aj201...@gmail.com> wrote: >>>>> Hi, >>>>> >>>>> We are hitting TaskManager deadlock on NetworkBufferPool bug in Flink >>>> 1.3.2. >>>>> We have set of ETL's merge jobs for a number of tables and stuck with >>>> above >>>>> issue randomly daily. >>>>> >>>>> I'm attaching the thread dump of JobManager and one of the Task Manager >>>> (T1) >>>>> running stuck job. >>>>> We also observed, sometimes new job scheduled on T1 progresses even >>>> another >>>>> job is stuck there. >>>>> >>>>> "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553) >>>>> (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Map >>>> (Map >>>>> at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)" >>> #1501 >>>>> daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in >>>> Object.wait() >>>>> [0x00007f9ebf102000] >>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>>> at java.lang.Object.wait(Native Method) >>>>> at >>>>> org.apache.flink.runtime.io.network.buffer. >>>> LocalBufferPool.requestBuffer(LocalBufferPool.java:224) >>>>> - locked <0x00000005e28fe218> (a java.util.ArrayDeque) >>>>> at >>>>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool. >>>> requestBufferBlocking(LocalBufferPool.java:193) >>>>> at >>>>> org.apache.flink.runtime.io.network.api.writer. >>>> RecordWriter.sendToTarget(RecordWriter.java:132) >>>>> - locked <0x00000005e29125f0> (a >>>>> org.apache.flink.runtime.io.network.api.serialization. >>>> SpanningRecordSerializer) >>>>> at >>>>> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit( >>>> RecordWriter.java:89) >>>>> at >>>>> org.apache.flink.runtime.operators.shipping.OutputCollector.collect( >>>> OutputCollector.java:65) >>>>> at >>>>> org.apache.flink.runtime.operators.util.metrics. >>>> CountingCollector.collect(CountingCollector.java:35) >>>>> at >>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect( >>>> ChainedMapDriver.java:79) >>>>> at >>>>> org.apache.flink.runtime.operators.util.metrics. >>>> CountingCollector.collect(CountingCollector.java:35) >>>>> at >>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect( >>>> ChainedMapDriver.java:79) >>>>> at >>>>> org.apache.flink.runtime.operators.util.metrics. >>>> CountingCollector.collect(CountingCollector.java:35) >>>>> at >>>>> org.apache.flink.runtime.operators.DataSourceTask. >>>> invoke(DataSourceTask.java:168) >>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) >>>>> at java.lang.Thread.run(Thread.java:748) >>>>> >>>>> -- >>>>> Thanks, >>>>> Amit >>>> >>> >> > > -- > Nico Kruber | Software Engineer > data Artisans > > Follow us @dataArtisans > -- > Join Flink Forward - The Apache Flink Conference > Stream Processing | Event Driven | Real Time > -- > Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany > data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen >