Hi Greg,
I carefully monitored all TM memory with jstat -gcutil and there'no full
gc, only .
The initial situation on the dying TM is:

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
GCT
  0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255
 2.763
  0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255
 2.873
  0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255
 2.986

After about 10 hours of processing is:

  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
33.267
  0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
33.267

So I don't think thta OOM could be an option.

However, the cluster is running on ESXi vSphere VMs and we already
experienced unexpected crash of jobs because of ESXi moving a heavy-loaded
VM to another (less loaded) physical machine..I would't be surprised if
swapping is also handled somehow differently..
Looking at Cloudera widgets I see that the crash is usually preceded by an
intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in those
scenarios. Am I wrong?

Any insight or debugging technique is  greatly appreciated.
Best,
Flavio


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com> wrote:

> Hi Flavio,
>
> Flink handles interrupts so the only silent killer I am aware of is
> Linux's OOM killer. Are you seeing such a message in dmesg?
>
> Greg
>
> On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> Hi to all,
>> I'd like to know whether memory swapping could cause a taskmanager crash.
>> In my cluster of virtual machines 'm seeing this strange behavior in my
>> Flink cluster: sometimes, if memory get swapped the taskmanager (on that
>> machine) dies unexpectedly without any log about the error.
>>
>> Is that possible or not?
>>
>> Best,
>> Flavio
>>
>
>

Reply via email to