Re: Flink and swapping question

Fabian Hueske Tue, 06 Jun 2017 03:01:52 -0700

Hi Flavio,

can you post the all memory configuration parameters of your workers?
Did you investigate which whether the direct or heap memory grew?


Thanks, Fabian

2017-05-29 20:53 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> Hi to all,
> I'm still trying to understand what's going on our production Flink
> cluster.
> The facts are:
>
> 1. The Flink cluster runs on 5 VMWare VMs managed by ESXi
> 2. On a specific  job we have, without limiting the direct memory to 5g,
> the TM gets killed by the OS almost immediately because the memory required
> by the TM, at some point, becomes huge, like > 100 GB (others jobs seem to
> be less affected by the problem )
> 3. Although the memory consumption is much better this way, the Flink TM
> memory continuously grow job after job (of this problematic type): we set
> TM max heap to 14 GB and the JVM required memory can be ~ 30 Gb. How is
> that possible?
>
> My fear is that there's some annoying memory leak / bad memory allocation
> in the Flink network level, but I can't have any evidence of this (except
> the fact that the vm which doesn't have a hdfs datanode underneath the
> Flink TM is the one with the biggest TM virtual memory consumption).
>
> Thanks for the help ,
> Flavio
>
> On 29 May 2017 15:37, "Nico Kruber" <n...@data-artisans.com> wrote:
>
>> FYI: taskmanager.sh sets this parameter but also states the following:
>>
>>   # Long.MAX_VALUE in TB: This is an upper bound, much less direct memory
>> will
>> be used
>>   TM_MAX_OFFHEAP_SIZE="8388607T"
>>
>>
>> Nico
>>
>> On Monday, 29 May 2017 15:19:47 CEST Aljoscha Krettek wrote:
>> > Hi Flavio,
>> >
>> > Is this running on YARN or bare metal? Did you manage to find out where
>> this
>> > insanely large parameter is coming from?
>> >
>> > Best,
>> > Aljoscha
>> >
>> > > On 25. May 2017, at 19:36, Flavio Pompermaier <pomperma...@okkam.it>
>> > > wrote:
>> > >
>> > > Hi to all,
>> > > I think we found the root cause of all the problems. Looking ad dmesg
>> > > there was a "crazy" total-vm size associated to the OOM error, a LOT
>> much
>> > > bigger than the TaskManager's available memory. In our case, the TM
>> had a
>> > > max heap of 14 GB while the dmsg error was reporting a required
>> amount of
>> > > memory in the order of 60 GB!
>> > >
>> > > [ 5331.992539] Out of memory: Kill process 24221 (java) score 937 or
>> > > sacrifice child [ 5331.992619] Killed process 24221 (java)
>> > > total-vm:64800680kB, anon-rss:31387544kB, file-rss:6064kB,
>> shmem-rss:0kB
>> > >
>> > > That wasn't definitively possible usin an ordinary JVM (and our TM was
>> > > running without off-heap settings) so we've looked at the parameters
>> used
>> > > to run the TM JVM and indeed there was a reall huge amount of memory
>> > > given to MaxDirectMemorySize. With my big surprise Flink runs a TM
>> with
>> > > this parameter set to 8.388.607T..does it make any sense?? Is it
>> > > documented anywhere the importance of this parameter (and why it is
>> used
>> > > in non off-heap mode as well)? Is it related to network buffers? It
>> > > should also be documented that this parameter should be added to the
>> TM
>> > > heap when reserving memory to Flin (IMHO).
>> > >
>> > > I hope that this painful sessions of Flink troubleshooting could be an
>> > > added value sooner or later..
>> > >
>> > > Best,
>> > > Flavio
>> > >
>> > > On Thu, May 25, 2017 at 10:21 AM, Flavio Pompermaier <
>> pomperma...@okkam.it
>> > > <mailto:pomperma...@okkam.it>> wrote: I can confirm that after giving
>> > > less memory to the Flink TM the job was able to run successfully.
>> After
>> > > almost 2 weeks of pain, we summarize here our experience with Fink in
>> > > virtualized environments (such as VMWare ESXi): Disable the
>> > > virtualization "feature" that transfer a VM from a (heavy loaded)
>> > > physical machine to another one (to balance the resource consumption)
>> > > Check dmesg when a TM dies without logging anything (usually it goes
>> OOM
>> > > and the OS kills it but there you can find the log of this thing)
>> CentOS
>> > > 7 on ESXi seems to start swapping VERY early (in my case I see the OS
>> > > starting swapping also if there are 12 out of 32 GB of free memory)!
>> > > We're still investigating how this behavior could be fixed: the
>> problem
>> > > is that it's better not to disable swapping because otherwise VMWare
>> > > could start ballooning (that is definitely worse...).
>> > >
>> > > I hope this tips could save someone else's day..
>> > >
>> > > Best,
>> > > Flavio
>> > >
>> > > On Wed, May 24, 2017 at 4:28 PM, Flavio Pompermaier <
>> pomperma...@okkam.it
>> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg, you were right! After
>> > > typing dmsg I found "Out of memory: Kill process 13574 (java)". This
>> is
>> > > really strange because the JVM of the TM is very calm.
>> > > Moreover, there are 7 GB of memory available (out of 32) but somehow
>> the
>> > > OS decides to start swapping and, when it runs out of available swap
>> > > memory, the OS decides to kill the Flink TM :(
>> > >
>> > > Any idea of what's going on here?
>> > >
>> > > On Wed, May 24, 2017 at 2:32 PM, Flavio Pompermaier <
>> pomperma...@okkam.it
>> > > <mailto:pomperma...@okkam.it>> wrote: Hi Greg,
>> > > I carefully monitored all TM memory with jstat -gcutil and there'no
>> full
>> > > gc, only .>
>> > > The initial situation on the dying TM is:
>> > >   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
>> > >   GCT 0.00 100.00  33.57  88.74  98.42  97.17    159    2.508     1
>> > >   0.255    2.763 0.00 100.00  90.14  88.80  98.67  97.17    197
>> 2.617
>> > >      1    0.255    2.873 0.00 100.00  27.00  88.82  98.75  97.17
>> 234
>> > >    2.730     1    0.255    2.986>
>> > > After about 10 hours of processing is:
>> > >   0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
>> 0.255
>> > >   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519   33.011     1
>> > >   0.255   33.267 0.00 100.00  21.74  83.66  98.52  96.94   5519
>>  33.011
>> > >      1    0.255   33.267>
>> > > So I don't think thta OOM could be an option.
>> > >
>> > > However, the cluster is running on ESXi vSphere VMs and we already
>> > > experienced unexpected crash of jobs because of ESXi moving a
>> > > heavy-loaded VM to another (less loaded) physical machine..I would't
>> be
>> > > surprised if swapping is also handled somehow differently.. Looking at
>> > > Cloudera widgets I see that the crash is usually preceded by an
>> intense
>> > > cpu_iowait period. I fear that Flink unsafe access to memory could be
>> a
>> > > problem in those scenarios. Am I wrong?
>> > >
>> > > Any insight or debugging technique is  greatly appreciated.
>> > > Best,
>> > > Flavio
>> > >
>> > >
>> > > On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <c...@greghogan.com
>> > > <mailto:c...@greghogan.com>> wrote: Hi Flavio,
>> > >
>> > > Flink handles interrupts so the only silent killer I am aware of is
>> > > Linux's OOM killer. Are you seeing such a message in dmesg?
>> > >
>> > > Greg
>> > >
>> > > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier <
>> pomperma...@okkam.it
>> > > <mailto:pomperma...@okkam.it>> wrote: Hi to all,
>> > > I'd like to know whether memory swapping could cause a taskmanager
>> crash.
>> > > In my cluster of virtual machines 'm seeing this strange behavior in
>> my
>> > > Flink cluster: sometimes, if memory get swapped the taskmanager (on
>> that
>> > > machine) dies unexpectedly without any log about the error.
>> > >
>> > > Is that possible or not?
>> > >
>> > > Best,
>> > > Flavio
>>
>>

Re: Flink and swapping question

Reply via email to