I can confirm that after giving less memory to the Flink TM the job was able to run successfully. After almost 2 weeks of pain, we summarize here our experience with Fink in virtualized environments (such as VMWare ESXi):
1. Disable the virtualization "feature" that transfer a VM from a (heavy loaded) physical machine to another one (to balance the resource consumption) 2. Check dmesg when a TM dies without logging anything (usually it goes OOM and the OS kills it but there you can find the log of this thing) 3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I see the OS starting swapping also if there are 12 out of 32 GB of free memory)! We're still investigating how this behavior could be fixed: the problem is that it's better not to disable swapping because otherwise VMWare could start ballooning (that is definitely worse...). I hope this tips could save someone else's day.. Best, Flavio On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote: > Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill > process 13574 (java)". > This is really strange because the JVM of the TM is very calm. > Moreover, there are 7 GB of memory available (out of 32) but somehow the > OS decides to start swapping and, when it runs out of available swap > memory, the OS decides to kill the Flink TM :( > > Any idea of what's going on here? > > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> Hi Greg, >> I carefully monitored all TM memory with jstat -gcutil and there'no full >> gc, only . >> The initial situation on the dying TM is: >> >> S0 S1 E O M CCS YGC YGCT FGC FGCT >> GCT >> 0.00 100.00 33.57 88.74 98.42 97.17 159 2.508 1 0.255 >> 2.763 >> 0.00 100.00 90.14 88.80 98.67 97.17 197 2.617 1 0.255 >> 2.873 >> 0.00 100.00 27.00 88.82 98.75 97.17 234 2.730 1 0.255 >> 2.986 >> >> After about 10 hours of processing is: >> >> 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 >> 33.267 >> 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 >> 33.267 >> 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 >> 33.267 >> >> So I don't think thta OOM could be an option. >> >> However, the cluster is running on ESXi vSphere VMs and we already >> experienced unexpected crash of jobs because of ESXi moving a heavy-loaded >> VM to another (less loaded) physical machine..I would't be surprised if >> swapping is also handled somehow differently.. >> Looking at Cloudera widgets I see that the crash is usually preceded by >> an intense cpu_iowait period. >> I fear that Flink unsafe access to memory could be a problem in those >> scenarios. Am I wrong? >> >> Any insight or debugging technique is greatly appreciated. >> Best, >> Flavio >> >> >> On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com> wrote: >> >>> Hi Flavio, >>> >>> Flink handles interrupts so the only silent killer I am aware of is >>> Linux's OOM killer. Are you seeing such a message in dmesg? >>> >>> Greg >>> >>> On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier < >>> pomperma...@okkam.it> wrote: >>> >>>> Hi to all, >>>> I'd like to know whether memory swapping could cause a taskmanager >>>> crash. >>>> In my cluster of virtual machines 'm seeing this strange behavior in my >>>> Flink cluster: sometimes, if memory get swapped the taskmanager (on that >>>> machine) dies unexpectedly without any log about the error. >>>> >>>> Is that possible or not? >>>> >>>> Best, >>>> Flavio >>>> >>> >>> >