Re: Flink and swapping question

Flavio Pompermaier Thu, 25 May 2017 10:37:19 -0700

Hi to all,
I think we found the root cause of all the problems. Looking ad dmesg there
was a "crazy" total-vm size associated to the OOM error, a LOT much bigger
than the TaskManager's available memory.
In our case, the TM had a max heap of 14 GB while the dmsg error was
reporting a required amount of memory in the order of 60 GB!


[ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
sacrifice child
[ 5331.992619] Killed process 24221 (java) *total-vm:64800680kB*,
anon-rss:31387544kB, file-rss:6064kB, shmem-rss:0kB

That wasn't definitively possible usin an ordinary JVM (and our TM was
running without off-heap settings) so we've looked at the parameters used
to run the TM JVM and indeed there was a reall huge amount of memory given
to MaxDirectMemorySize. With my big surprise Flink runs a TM with this
parameter set to 8.388.607T..does it make any sense??
Is it documented anywhere the importance of this parameter (and why it is
used in non off-heap mode as well)? Is it related to network buffers?
It should also be documented that this parameter should be added to the TM
heap when reserving memory to Flin (IMHO).

I hope that this painful sessions of Flink troubleshooting could be an
added value sooner or later..

Best,
Flavio

On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> I can confirm that after giving less memory to the Flink TM the job was
> able to run successfully.
> After almost 2 weeks of pain, we summarize here our experience with Fink
> in virtualized environments (such as VMWare ESXi):
>
>    1. Disable the virtualization "feature" that transfer a VM from a
>    (heavy loaded) physical machine to another one (to balance the resource
>    consumption)
>    2. Check dmesg when a TM dies without logging anything (usually it
>    goes OOM and the OS kills it but there you can find the log of this thing)
>    3. CentOS 7 on ESXi seems to start swapping VERY early (in my case I
>    see the OS starting swapping also if there are 12 out of 32 GB of free
>    memory)!
>
> We're still investigating how this behavior could be fixed: the problem is
> that it's better not to disable swapping because otherwise VMWare could
> start ballooning (that is definitely worse...).
>
> I hope this tips could save someone else's day..
>
> Best,
> Flavio
>
> On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> Hi Greg, you were right! After typing dmsg I found "Out of memory: Kill
>> process 13574 (java)".
>> This is really strange because the JVM of the TM is very calm.
>> Moreover, there are 7 GB of memory available (out of 32) but somehow the
>> OS decides to start swapping and, when it runs out of available swap
>> memory, the OS decides to kill the Flink TM :(
>>
>> Any idea of what's going on here?
>>
>> On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> Hi Greg,
>>> I carefully monitored all TM memory with jstat -gcutil and there'no full
>>> gc, only .
>>> The initial situation on the dying TM is:
>>>
>>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
>>>   GCT
>>>   0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1    0.255
>>>    2.763
>>>   0.00 100.00  90.14  88.80  98.67  97.17    197    2.617     1    0.255
>>>    2.873
>>>   0.00 100.00  27.00  88.82  98.75  97.17    234    2.730     1    0.255
>>>    2.986
>>>
>>> After about 10 hours of processing is:
>>>
>>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
>>>   33.267
>>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
>>>   33.267
>>>   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1    0.255
>>>   33.267
>>>
>>> So I don't think thta OOM could be an option.
>>>
>>> However, the cluster is running on ESXi vSphere VMs and we already
>>> experienced unexpected crash of jobs because of ESXi moving a heavy-loaded
>>> VM to another (less loaded) physical machine..I would't be surprised if
>>> swapping is also handled somehow differently..
>>> Looking at Cloudera widgets I see that the crash is usually preceded by
>>> an intense cpu_iowait period.
>>> I fear that Flink unsafe access to memory could be a problem in those
>>> scenarios. Am I wrong?
>>>
>>> Any insight or debugging technique is  greatly appreciated.
>>> Best,
>>> Flavio
>>>
>>>
>>> On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> Flink handles interrupts so the only silent killer I am aware of is
>>>> Linux's OOM killer. Are you seeing such a message in dmesg?
>>>>
>>>> Greg
>>>>
>>>> On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <
>>>> pomperma...@okkam.it> wrote:
>>>>
>>>>> Hi to all,
>>>>> I'd like to know whether memory swapping could cause a taskmanager
>>>>> crash.
>>>>> In my cluster of virtual machines 'm seeing this strange behavior in
>>>>> my Flink cluster: sometimes, if memory get swapped the taskmanager (on 
>>>>> that
>>>>> machine) dies unexpectedly without any log about the error.
>>>>>
>>>>> Is that possible or not?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>
>>>>
>>

Re: Flink and swapping question

Reply via email to