Hi Flavio, can you post the all memory configuration parameters of your workers? Did you investigate which whether the direct or heap memory grew?
Thanks, Fabian 2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > Hi to all, > I'm still trying to understand what's going on our production Flink > cluster. > The facts are: > > 1. The Flink cluster runs on 5 VMWare VMs managed by ESXi > 2. On a specific job we have, without limiting the direct memory to 5g, > the TM gets killed by the OS almost immediately because the memory required > by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to > be less affected by the problem ) > 3. Although the memory consumption is much better this way, the Flink TM > memory continuously grow job after job (of this problematic type): we set > TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is > that possible? > > My fear is that there's some annoying memory leak / bad memory allocation > in the Flink network level, but I can't have any evidence of this (except > the fact that the vm which doesn't have a hdfs datanode underneath the > Flink TM is the one with the biggest TM virtual memory consumption). > > Thanks for the help , > Flavio > > On 29 May 2017 15:37, "Nico Kruber" <n...@data-artisans.com> wrote: > >> FYI: taskmanager.sh sets this parameter but also states the following: >> >> # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory >> will >> be used >> TM_MAX_OFFHEAP_SIZE="8388607T" >> >> >> Nico >> >> On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote: >> > Hi Flavio, >> > >> > Is this running on YARN or bare metal? Did you manage to find out where >> this >> > insanely large parameter is coming from? >> > >> > Best, >> > Aljoscha >> > >> > > On 25. May 2017, at 19:36, Flavio Pompermaier <pomperma...@okkam.it> >> > > wrote: >> > > >> > > Hi to all, >> > > I think we found the root cause of all the problems. Looking ad dmesg >> > > there was a "crazy" total-vm size associated to the OOM error, a LOT >> much >> > > bigger than the TaskManager's available memory. In our case, the TM >> had a >> > > max heap of 14 GB while the dmsg error was reporting a required >> amount of >> > > memory in the order of 60 GB! >> > > >> > > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or >> > > sacrifice child [ 5331.992619] Killed process 24221 (java) >> > > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB, >> shmem-rss:0kB >> > > >> > > That wasn't definitively possible usin an ordinary JVM (and our TM was >> > > running without off-heap settings) so we've looked at the parameters >> used >> > > to run the TM JVM and indeed there was a reall huge amount of memory >> > > given to MaxDirectMemorySize. With my big surprise Flink runs a TM >> with >> > > this parameter set to 8.388.607T..does it make any sense?? Is it >> > > documented anywhere the importance of this parameter (and why it is >> used >> > > in non off-heap mode as well)? Is it related to network buffers? It >> > > should also be documented that this parameter should be added to the >> TM >> > > heap when reserving memory to Flin (IMHO). >> > > >> > > I hope that this painful sessions of Flink troubleshooting could be an >> > > added value sooner or later.. >> > > >> > > Best, >> > > Flavio >> > > >> > > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier < >> pomperma...@okkam.it >> > > <mailto:pomperma...@okkam.it>> wrote: I can confirm that after giving >> > > less memory to the Flink TM the job was able to run successfully. >> After >> > > almost 2 weeks of pain, we summarize here our experience with Fink in >> > > virtualized environments (such as VMWare ESXi): Disable the >> > > virtualization "feature" that transfer a VM from a (heavy loaded) >> > > physical machine to another one (to balance the resource consumption) >> > > Check dmesg when a TM dies without logging anything (usually it goes >> OOM >> > > and the OS kills it but there you can find the log of this thing) >> CentOS >> > > 7 on ESXi seems to start swapping VERY early (in my case I see the OS >> > > starting swapping also if there are 12 out of 32 GB of free memory)! >> > > We're still investigating how this behavior could be fixed: the >> problem >> > > is that it's better not to disable swapping because otherwise VMWare >> > > could start ballooning (that is definitely worse...). >> > > >> > > I hope this tips could save someone else's day.. >> > > >> > > Best, >> > > Flavio >> > > >> > > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier < >> pomperma...@okkam.it >> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg, you were right! After >> > > typing dmsg I found "Out of memory: Kill process 13574 (java)". This >> is >> > > really strange because the JVM of the TM is very calm. >> > > Moreover, there are 7 GB of memory available (out of 32) but somehow >> the >> > > OS decides to start swapping and, when it runs out of available swap >> > > memory, the OS decides to kill the Flink TM :( >> > > >> > > Any idea of what's going on here? >> > > >> > > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier < >> pomperma...@okkam.it >> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg, >> > > I carefully monitored all TM memory with jstat -gcutil and there'no >> full >> > > gc, only .> >> > > The initial situation on the dying TM is: >> > > S0 S1 E O M CCS YGC YGCT FGC FGCT >> > > GCT 0.00 100.00 33.57 88.74 98.42 97.17 159 2.508 1 >> > > 0.255 2.763 0.00 100.00 90.14 88.80 98.67 97.17 197 >> 2.617 >> > > 1 0.255 2.873 0.00 100.00 27.00 88.82 98.75 97.17 >> 234 >> > > 2.730 1 0.255 2.986> >> > > After about 10 hours of processing is: >> > > 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 >> 0.255 >> > > 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 >> > > 0.255 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 >> 33.011 >> > > 1 0.255 33.267> >> > > So I don't think thta OOM could be an option. >> > > >> > > However, the cluster is running on ESXi vSphere VMs and we already >> > > experienced unexpected crash of jobs because of ESXi moving a >> > > heavy-loaded VM to another (less loaded) physical machine..I would't >> be >> > > surprised if swapping is also handled somehow differently.. Looking at >> > > Cloudera widgets I see that the crash is usually preceded by an >> intense >> > > cpu_iowait period. I fear that Flink unsafe access to memory could be >> a >> > > problem in those scenarios. Am I wrong? >> > > >> > > Any insight or debugging technique is greatly appreciated. >> > > Best, >> > > Flavio >> > > >> > > >> > > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com >> > > <mailto:c...@greghogan.com>> wrote: Hi Flavio, >> > > >> > > Flink handles interrupts so the only silent killer I am aware of is >> > > Linux's OOM killer. Are you seeing such a message in dmesg? >> > > >> > > Greg >> > > >> > > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier < >> pomperma...@okkam.it >> > > <mailto:pomperma...@okkam.it>> wrote: Hi to all, >> > > I'd like to know whether memory swapping could cause a taskmanager >> crash. >> > > In my cluster of virtual machines 'm seeing this strange behavior in >> my >> > > Flink cluster: sometimes, if memory get swapped the taskmanager (on >> that >> > > machine) dies unexpectedly without any log about the error. >> > > >> > > Is that possible or not? >> > > >> > > Best, >> > > Flavio >> >>