Re: Sporadic high IO bandwidth and Linux OOM killer

Oleksandr Shulgin Fri, 28 Dec 2018 10:23:32 -0800

On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:


>
> After a fresh JVM start the memory allocation looks roughly like this:
>
>              total       used       free     shared    buffers     cached
> Mem:           14G        14G       173M       1.1M        12M       3.2G
> -/+ buffers/cache:        11G       3.4G
> Swap:           0B         0B         0B
>
> Then, within a number of days, the allocated disk cache shrinks all the
> way down to unreasonable numbers like only 150M.  At the same time "free"
> stays at the original level and "used" grows all the way up to 14G.
> Shortly after that the node becomes unavailable because of the IO and
> ultimately after some time the JVM gets killed.
>
> Most importantly, the resident size of JVM process stays at around 11-12G
> all the time, like it was shortly after the start.  How can we find where
> the rest of the memory gets allocated?  Is it just some sort of malloc
> fragmentation?
>

For the ones following along at home, here's what we ended up with so far:

0. Switched to the next biggest EC2 instance type, r4.xlarge: and the
symptoms are gone.  Our bill is dominated by the price EBS storage, so this
is much less than 2x increase in total.

1. We've noticed that increased memory usage correlates with the number of
SSTables on disk.  When the number of files on disk decreases, available
memory increases.  This leads us to think that extra memory allocation is
indeed due to use of mmap.  Not clear how we could account for that.

2. Improved our monitoring to include number of files (via total - free
inodes).

Given the cluster's resource utilization, it still feels like r4.large
would be a good fit, if only we could figure out those few "missing" GB of
RAM. ;-)

Cheers!
--
Alex

Re: Sporadic high IO bandwidth and Linux OOM killer

Reply via email to