Re: Flink and swapping question

2017-06-07 Thread Flavio Pompermaier
I forgot to mention that my jobs are all batch (at the moment). Do you think that this problem could be related to - http://www.evanjones.ca/java-bytebuffer-leak.html#comment-3240054880 - and http://www.evanjones.ca/java-native-leak-bug.html Kurt told me also to add "env.java.opts: -Dio.ne

Re: Flink and swapping question

2017-06-06 Thread Flavio Pompermaier
Hi Stephan, I also think that the error is more related to netty. The only suspicious library I use are parquet or thrift. I'm not using off-heap memory. What do you mean for "crazy high number of concurrent network shuffles"?how can I count that? We're using java 8. Thanks a lot, Flavio On 6 J

Re: Flink and swapping question

2017-06-06 Thread Stephan Ewen
Hi! I would actually be surprised if this is an issue in core Flink. - The MaxDirectMemory parameter is pretty meaningless, it really is a max and does not have an impact on how much is actually allocated. - In most cases we had reported so far, the leak was in a library that was used in the

Re: Flink and swapping question

2017-06-06 Thread Fabian Hueske
Hi Flavio, can you post the all memory configuration parameters of your workers? Did you investigate which whether the direct or heap memory grew? Thanks, Fabian 2017-05-29 20:53 GMT+02:00 Flavio Pompermaier : > Hi to all, > I'm still trying to understand what's going on our production Flink >

Re: Flink and swapping question

2017-05-29 Thread Flavio Pompermaier
Hi to all, I'm still trying to understand what's going on our production Flink cluster. The facts are: 1. The Flink cluster runs on 5 VMWare VMs managed by ESXi 2. On a specific job we have, without limiting the direct memory to 5g, the TM gets killed by the OS almost immediately because the memo

Re: Flink and swapping question

2017-05-29 Thread Nico Kruber
FYI: taskmanager.sh sets this parameter but also states the following: # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory will be used TM_MAX_OFFHEAP_SIZE="8388607T" Nico On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote: > Hi Flavio, > > Is this running on

Re: Flink and swapping question

2017-05-29 Thread Aljoscha Krettek
Hi Flavio, Is this running on YARN or bare metal? Did you manage to find out where this insanely large parameter is coming from? Best, Aljoscha > On 25. May 2017, at 19:36, Flavio Pompermaier wrote: > > Hi to all, > I think we found the root cause of all the problems. Looking ad dmesg there

Re: Flink and swapping question

2017-05-25 Thread Flavio Pompermaier
Hi to all, I think we found the root cause of all the problems. Looking ad dmesg there was a "crazy" total-vm size associated to the OOM error, a LOT much bigger than the TaskManager's available memory. In our case, the TM had a max heap of 14 GB while the dmsg error was reporting a required amount

Re: Flink and swapping question

2017-05-25 Thread Flavio Pompermaier
I can confirm that after giving less memory to the Flink TM the job was able to run successfully. After almost 2 weeks of pain, we summarize here our experience with Fink in virtualized environments (such as VMWare ESXi): 1. Disable the virtualization "feature" that transfer a VM from a (heavy

Re: Flink and swapping question

2017-05-24 Thread Flavio Pompermaier
Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill process 13574 (java)". This is really strange because the JVM of the TM is very calm. Moreover, there are 7 GB of memory available (out of 32) but somehow the OS decides to start swapping and, when it runs out of available swap

Re: Flink and swapping question

2017-05-24 Thread Flavio Pompermaier
Hi Greg, I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only . The initial situation on the dying TM is: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 100.00 33.57 88.74 98.42 97.171592.508 10.255 2.763 0.00 1

Re: Flink and swapping question

2017-05-24 Thread Greg Hogan
Hi Flavio, Flink handles interrupts so the only silent killer I am aware of is Linux's OOM killer. Are you seeing such a message in dmesg? Greg On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier wrote: > Hi to all, > I'd like to know whether memory swapping could cause a taskmanager crash. >

Flink and swapping question

2017-05-24 Thread Flavio Pompermaier
Hi to all, I'd like to know whether memory swapping could cause a taskmanager crash. In my cluster of virtual machines 'm seeing this strange behavior in my Flink cluster: sometimes, if memory get swapped the taskmanager (on that machine) dies unexpectedly without any log about the error. Is that