Hi Greg, I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only . The initial situation on the dying TM is:
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 100.00 33.57 88.74 98.42 97.17 159 2.508 1 0.255 2.763 0.00 100.00 90.14 88.80 98.67 97.17 197 2.617 1 0.255 2.873 0.00 100.00 27.00 88.82 98.75 97.17 234 2.730 1 0.255 2.986 After about 10 hours of processing is: 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 33.267 So I don't think thta OOM could be an option. However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently.. Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period. I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong? Any insight or debugging technique is greatly appreciated. Best, Flavio On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com> wrote: > Hi Flavio, > > Flink handles interrupts so the only silent killer I am aware of is > Linux's OOM killer. Are you seeing such a message in dmesg? > > Greg > > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> Hi to all, >> I'd like to know whether memory swapping could cause a taskmanager crash. >> In my cluster of virtual machines 'm seeing this strange behavior in my >> Flink cluster: sometimes, if memory get swapped the taskmanager (on that >> machine) dies unexpectedly without any log about the error. >> >> Is that possible or not? >> >> Best, >> Flavio >> > >