Amit: Execution plan attachment didn't come through. Please consider using third party website for storing the plan.
FYI On Thu, Apr 19, 2018 at 10:04 AM, Amit Jain <aj201...@gmail.com> wrote: > @Ufuk Please find execution plan in the attachment. > > @Nico Job is not making progress at all. This issue is happening > randomly. Few of our jobs are working with only few MB of data and still, > they are getting stuck even TM have 22G with 2 slots per TM. > > I've started using 1.5 and facing few issues which I'm communicating with > community these days. However, this issue seems to be solved there :-) Do > you guys have a timeline for 1.5 release? > > -- > Thanks, > Amit > > > > > > On Fri, Apr 6, 2018 at 10:40 PM, Ufuk Celebi <u...@apache.org> wrote: > >> Hey Amit! >> >> Thanks for posting this here. I don't think it's an issue of the >> buffer pool per se. Instead I think there are two potential causes >> here: >> >> 1. The generated flow doesn't use blocking intermediate results for a >> branching-joining flow. >> => I think we can check it if you run and post the output of >> `StreamExecutionEnvironment#getExecutionPlan()` here. Can you please >> post the result of this here? >> >> 2. The blocking intermediate result is used but there is an issue with >> the implementation of them. >> => Depending on the output of 1, we can investigate this option. >> >> As Fabian mentioned, running this with a newer version of Flink will >> be very helpful. If the problem still persists, it will also make it >> more likely that the issue will be fixed faster. ;-) >> >> – Ufuk >> >> >> On Fri, Apr 6, 2018 at 5:43 AM, Nico Kruber <n...@data-artisans.com> >> wrote: >> > I'm not aware of any changes regarding the blocking buffer pools though. >> > >> > Is it really stuck or just making progress slowly? (You can check with >> > the number or records sent/received in the Web UI) >> > >> > Anyway, this may also simply mean that the task is back-pressured >> > depending on how the operators are wired together. In that case, all >> > available buffers for that ResultPartition have been used (records were >> > serialized into them) and are now waiting on Netty to send or a >> > SpillingSubpartition to spill data to disk. >> > Please also check for warnings or errors in the affected TaskManager's >> > log files. >> > >> > If you can reproduce the problem, could you try reducing your program to >> > a minimal working example and provide it here for further debugging? >> > >> > >> > Thanks >> > Nico >> > >> > On 04/04/18 23:00, Fabian Hueske wrote: >> >> Hi Amit, >> >> >> >> The network stack has been redesigned for the upcoming Flink 1.5 >> release. >> >> The issue might have been fixed by that. >> >> >> >> There's already a first release candidate for Flink 1.5.0 available >> [1]. >> >> It would be great if you would have the chance to check if the bug is >> still >> >> present. >> >> >> >> Best, Fabian >> >> >> >> [1] >> >> https://lists.apache.org/thread.html/a6b6fb1a42a975608fa8641 >> c86df30b47f022985ade845f1f1ec542a@%3Cdev.flink.apache.org%3E >> >> >> >> 2018-04-04 20:23 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: >> >> >> >>> I searched for 0x00000005e28fe218 in the two files you attached >> >>> to FLINK-2685 but didn't find any hit. >> >>> >> >>> Was this the same instance as the attachment to FLINK-2685 ? >> >>> >> >>> Thanks >> >>> >> >>> On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain <aj201...@gmail.com> >> wrote: >> >>> >> >>>> +u...@flink.apache.org >> >>>> >> >>>> On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain <aj201...@gmail.com> >> wrote: >> >>>>> Hi, >> >>>>> >> >>>>> We are hitting TaskManager deadlock on NetworkBufferPool bug in >> Flink >> >>>> 1.3.2. >> >>>>> We have set of ETL's merge jobs for a number of tables and stuck >> with >> >>>> above >> >>>>> issue randomly daily. >> >>>>> >> >>>>> I'm attaching the thread dump of JobManager and one of the Task >> Manager >> >>>> (T1) >> >>>>> running stuck job. >> >>>>> We also observed, sometimes new job scheduled on T1 progresses even >> >>>> another >> >>>>> job is stuck there. >> >>>>> >> >>>>> "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553) >> >>>>> (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> >> Map >> >>>> (Map >> >>>>> at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)" >> >>> #1501 >> >>>>> daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in >> >>>> Object.wait() >> >>>>> [0x00007f9ebf102000] >> >>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >> >>>>> at java.lang.Object.wait(Native Method) >> >>>>> at >> >>>>> org.apache.flink.runtime.io.network.buffer. >> >>>> LocalBufferPool.requestBuffer(LocalBufferPool.java:224) >> >>>>> - locked <0x00000005e28fe218> (a java.util.ArrayDeque) >> >>>>> at >> >>>>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool. >> >>>> requestBufferBlocking(LocalBufferPool.java:193) >> >>>>> at >> >>>>> org.apache.flink.runtime.io.network.api.writer. >> >>>> RecordWriter.sendToTarget(RecordWriter.java:132) >> >>>>> - locked <0x00000005e29125f0> (a >> >>>>> org.apache.flink.runtime.io.network.api.serialization. >> >>>> SpanningRecordSerializer) >> >>>>> at >> >>>>> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit( >> >>>> RecordWriter.java:89) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.shipping.OutputCollector. >> collect( >> >>>> OutputCollector.java:65) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.util.metrics. >> >>>> CountingCollector.collect(CountingCollector.java:35) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver >> .collect( >> >>>> ChainedMapDriver.java:79) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.util.metrics. >> >>>> CountingCollector.collect(CountingCollector.java:35) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver >> .collect( >> >>>> ChainedMapDriver.java:79) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.util.metrics. >> >>>> CountingCollector.collect(CountingCollector.java:35) >> >>>>> at >> >>>>> org.apache.flink.runtime.operators.DataSourceTask. >> >>>> invoke(DataSourceTask.java:168) >> >>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) >> >>>>> at java.lang.Thread.run(Thread.java:748) >> >>>>> >> >>>>> -- >> >>>>> Thanks, >> >>>>> Amit >> >>>> >> >>> >> >> >> > >> > -- >> > Nico Kruber | Software Engineer >> > data Artisans >> > >> > Follow us @dataArtisans >> > -- >> > Join Flink Forward - The Apache Flink Conference >> > Stream Processing | Event Driven | Real Time >> > -- >> > Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany >> <https://maps.google.com/?q=Stresemannstr.+121A,10963+Berlin,+Germany&entry=gmail&source=g> >> > data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA >> <https://maps.google.com/?q=1161+Mission+Street,+San+Francisco,+CA-94103,+USA&entry=gmail&source=g> >> > -- >> > Data Artisans GmbH >> > Registered at Amtsgericht Charlottenburg: HRB 158244 B >> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen >> > >> > >