Re: Heartbeat of TaskManager timed out.

2020-07-07 Thread Ori Popowski
I wouldn't want to jump into conclusions, but from what I see, very large lists and vectors do not work well with flatten in 2.11, each for its own reasons. In any case, it's 100% not a Flink issue. On Tue, Jul 7, 2020 at 10:10 AM Xintong Song wrote: > Thanks for the updates, Ori. > > I'm not f

Re: Heartbeat of TaskManager timed out.

2020-07-07 Thread Xintong Song
Thanks for the updates, Ori. I'm not familiar with Scala. Just curious, if what you suspect is true, is it a bug of Scala? Thank you~ Xintong Song On Tue, Jul 7, 2020 at 1:41 PM Ori Popowski wrote: > Hi, > > I just wanted to update that the problem is now solved! > > I suspect that Scala's

Re: Heartbeat of TaskManager timed out.

2020-07-06 Thread Ori Popowski
Hi, I just wanted to update that the problem is now solved! I suspect that Scala's flatten() method has a memory problem on very large lists (> 2 billion elements). When using Scala Lists, the memory seems to leak but the app keeps running, and when using Scala Vectors, a weird IllegalArgumentExc

Re: Heartbeat of TaskManager timed out.

2020-07-02 Thread Xintong Song
I agree with Roman's suggestion for increasing heap size. It seems that the heap grows faster than freed. Thus eventually the Full GC is triggered, taking more than 50s and causing the timeout. However, even the full GC frees only 2GB space out of the 28GB max size. That probably suggests that the

Re: Heartbeat of TaskManager timed out.

2020-07-02 Thread Ori Popowski
Thank you very much for your analysis. When I said there was no memory leak - I meant that from the specific TaskManager I monitored in real-time using JProfiler. Unfortunately, this problem occurs only in 1 of the TaskManager and you cannot anticipate which. So when you pick a TM to profile at ra

Re: Heartbeat of TaskManager timed out.

2020-07-02 Thread Khachatryan Roman
Thanks, Ori >From the log, it looks like there IS a memory leak. At 10:12:53 there was the last "successfull" gc when 13Gb freed in 0.4653809 secs: [Eden: 17336.0M(17336.0M)->0.0B(2544.0M) Survivors: 40960.0K->2176.0M Heap: 23280.3M(28960.0M)->10047.0M(28960.0M)] Then the heap grew from 10G to 2

Re: Heartbeat of TaskManager timed out.

2020-07-01 Thread Xintong Song
Maybe you can share the log and gc-log of the problematic TaskManager? See if we can find any clue. Thank you~ Xintong Song On Wed, Jul 1, 2020 at 8:11 PM Ori Popowski wrote: > I've found out that sometimes one of my TaskManagers experiences a GC > pause of 40-50 seconds and I have no idea w

Re: Heartbeat of TaskManager timed out.

2020-07-01 Thread Ori Popowski
I've found out that sometimes one of my TaskManagers experiences a GC pause of 40-50 seconds and I have no idea why. I profiled one of the machines using JProfiler and everything looks fine. No memory leaks, memory is low. However, I cannot anticipate which of the machines will get the 40-50 second

Re: Heartbeat of TaskManager timed out.

2020-06-28 Thread Xintong Song
In Flink 1.10, there's a huge change in the memory management compared to previous versions. This could be related to your observations, because with the same configurations, it is possible that there's less JVM heap space (with more off-heap memory). Please take a look at this migration guide [1].

Re: Heartbeat of TaskManager timed out.

2020-06-28 Thread Ori Popowski
Thanks for the suggestions! > i recently tried 1.10 and see this error frequently. and i dont have the same issue when running with 1.9.1 I did downgrade to Flink 1.9 and there's certainly no change in the occurrences in the heartbeat timeout > - Probably the most straightforward way is to t

Re: Heartbeat of TaskManager timed out.

2020-06-27 Thread Xintong Song
Hi Ori, Here are some suggestions from my side. - Probably the most straightforward way is to try increasing the timeout to see if that helps. You can leverage the configuration option `heartbeat.timeout`[1]. The default is 50s. - It might be helpful to share your configuration setups