Re: TaskManager deadlock on NetworkBufferPool

Ufuk Celebi Fri, 06 Apr 2018 10:11:40 -0700

Hey Amit!

Thanks for posting this here. I don't think it's an issue of the
buffer pool per se. Instead I think there are two potential causes
here:


1. The generated flow doesn't use blocking intermediate results for a
branching-joining flow.
=> I think we can check it if you run and post the output of
`StreamExecutionEnvironment#getExecutionPlan()` here. Can you please
post the result of this here?

2. The blocking intermediate result is used but there is an issue with
the implementation of them.
=> Depending on the output of 1, we can investigate this option.

As Fabian mentioned, running this with a newer version of Flink will
be very helpful. If the problem still persists, it will also make it
more likely that the issue will be fixed faster. ;-)

– Ufuk


On Fri, Apr 6, 2018 at 5:43 AM, Nico Kruber <n...@data-artisans.com> wrote:
> I'm not aware of any changes regarding the blocking buffer pools though.
>
> Is it really stuck or just making progress slowly? (You can check with
> the number or records sent/received in the Web UI)
>
> Anyway, this may also simply mean that the task is back-pressured
> depending on how the operators are wired together. In that case, all
> available buffers for that ResultPartition have been used (records were
> serialized into them) and are now waiting on Netty to send or a
> SpillingSubpartition to spill data to disk.
> Please also check for warnings or errors in the affected TaskManager's
> log files.
>
> If you can reproduce the problem, could you try reducing your program to
> a minimal working example and provide it here for further debugging?
>
>
> Thanks
> Nico
>
> On 04/04/18 23:00, Fabian Hueske wrote:
>> Hi Amit,
>>
>> The network stack has been redesigned for the upcoming Flink 1.5 release.
>> The issue might have been fixed by that.
>>
>> There's already a first release candidate for Flink 1.5.0 available [1].
>> It would be great if you would have the chance to check if the bug is still
>> present.
>>
>> Best, Fabian
>>
>> [1]
>> https://lists.apache.org/thread.html/a6b6fb1a42a975608fa8641c86df30b47f022985ade845f1f1ec542a@%3Cdev.flink.apache.org%3E
>>
>> 2018-04-04 20:23 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
>>
>>> I searched for 0x00000005e28fe218 in the two files you attached
>>> to FLINK-2685 but didn't find any hit.
>>>
>>> Was this the same instance as the attachment to FLINK-2685 ?
>>>
>>> Thanks
>>>
>>> On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain <aj201...@gmail.com> wrote:
>>>
>>>> +u...@flink.apache.org
>>>>
>>>> On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain <aj201...@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> We are hitting TaskManager deadlock on NetworkBufferPool bug in Flink
>>>> 1.3.2.
>>>>> We have set of ETL's merge jobs for a number of tables and stuck with
>>>> above
>>>>> issue randomly daily.
>>>>>
>>>>> I'm attaching the thread dump of JobManager and one of the Task Manager
>>>> (T1)
>>>>> running stuck job.
>>>>> We also observed, sometimes new job scheduled on T1 progresses even
>>>> another
>>>>> job is stuck there.
>>>>>
>>>>> "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553)
>>>>> (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Map
>>>> (Map
>>>>> at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)"
>>> #1501
>>>>> daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in
>>>> Object.wait()
>>>>> [0x00007f9ebf102000]
>>>>>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>>> at java.lang.Object.wait(Native Method)
>>>>> at
>>>>> org.apache.flink.runtime.io.network.buffer.
>>>> LocalBufferPool.requestBuffer(LocalBufferPool.java:224)
>>>>> - locked <0x00000005e28fe218> (a java.util.ArrayDeque)
>>>>> at
>>>>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.
>>>> requestBufferBlocking(LocalBufferPool.java:193)
>>>>> at
>>>>> org.apache.flink.runtime.io.network.api.writer.
>>>> RecordWriter.sendToTarget(RecordWriter.java:132)
>>>>> - locked <0x00000005e29125f0> (a
>>>>> org.apache.flink.runtime.io.network.api.serialization.
>>>> SpanningRecordSerializer)
>>>>> at
>>>>> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(
>>>> RecordWriter.java:89)
>>>>> at
>>>>> org.apache.flink.runtime.operators.shipping.OutputCollector.collect(
>>>> OutputCollector.java:65)
>>>>> at
>>>>> org.apache.flink.runtime.operators.util.metrics.
>>>> CountingCollector.collect(CountingCollector.java:35)
>>>>> at
>>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(
>>>> ChainedMapDriver.java:79)
>>>>> at
>>>>> org.apache.flink.runtime.operators.util.metrics.
>>>> CountingCollector.collect(CountingCollector.java:35)
>>>>> at
>>>>> org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(
>>>> ChainedMapDriver.java:79)
>>>>> at
>>>>> org.apache.flink.runtime.operators.util.metrics.
>>>> CountingCollector.collect(CountingCollector.java:35)
>>>>> at
>>>>> org.apache.flink.runtime.operators.DataSourceTask.
>>>> invoke(DataSourceTask.java:168)
>>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Amit
>>>>
>>>
>>
>
> --
> Nico Kruber | Software Engineer
> data Artisans
>
> Follow us @dataArtisans
> --
> Join Flink Forward - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
> data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>

Re: TaskManager deadlock on NetworkBufferPool

Reply via email to