Re: TaskManager deadlock on NetworkBufferPool

2018-04-19 Thread Amit Jain
Please use this link. https://gist.github.com/imamitjain/5ab84c2d9eaf06615ad912506a08f7e2 On Thu, Apr 19, 2018 at 10:37 PM, Ted Yu wrote: > Amit: > Execution plan attachment didn't come through. > > Please consider using third party website for storing the plan. > > FYI > > On Thu, Apr 19, 2018

Re: TaskManager deadlock on NetworkBufferPool

2018-04-19 Thread Ted Yu
Amit: Execution plan attachment didn't come through. Please consider using third party website for storing the plan. FYI On Thu, Apr 19, 2018 at 10:04 AM, Amit Jain wrote: > @Ufuk Please find execution plan in the attachment. > > @Nico Job is not making progress at all. This issue is happening

Re: TaskManager deadlock on NetworkBufferPool

2018-04-19 Thread Amit Jain
@Ufuk Please find execution plan in the attachment. @Nico Job is not making progress at all. This issue is happening randomly. Few of our jobs are working with only few MB of data and still, they are getting stuck even TM have 22G with 2 slots per TM. I've started using 1.5 and facing few issues

Re: TaskManager deadlock on NetworkBufferPool

2018-04-06 Thread Ufuk Celebi
Hey Amit! Thanks for posting this here. I don't think it's an issue of the buffer pool per se. Instead I think there are two potential causes here: 1. The generated flow doesn't use blocking intermediate results for a branching-joining flow. => I think we can check it if you run and post the outp

Re: TaskManager deadlock on NetworkBufferPool

2018-04-06 Thread Nico Kruber
I'm not aware of any changes regarding the blocking buffer pools though. Is it really stuck or just making progress slowly? (You can check with the number or records sent/received in the Web UI) Anyway, this may also simply mean that the task is back-pressured depending on how the operators are w

Re: TaskManager deadlock on NetworkBufferPool

2018-04-04 Thread Fabian Hueske
Hi Amit, The network stack has been redesigned for the upcoming Flink 1.5 release. The issue might have been fixed by that. There's already a first release candidate for Flink 1.5.0 available [1]. It would be great if you would have the chance to check if the bug is still present. Best, Fabian

Re: TaskManager deadlock on NetworkBufferPool

2018-04-04 Thread Ted Yu
I searched for 0x0005e28fe218 in the two files you attached to FLINK-2685 but didn't find any hit. Was this the same instance as the attachment to FLINK-2685 ? Thanks On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain wrote: > +u...@flink.apache.org > > On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain

Re: TaskManager deadlock on NetworkBufferPool

2018-04-04 Thread Amit Jain
+u...@flink.apache.org On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain wrote: > Hi, > > We are hitting TaskManager deadlock on NetworkBufferPool bug in Flink 1.3.2. > We have set of ETL's merge jobs for a number of tables and stuck with above > issue randomly daily. > > I'm attaching the thread dump o