It is a bug. In some contexts, the kernel needs to be able to reclaim memory instantly, but this is not one of them. Here, the java process is creating a new thread, and the kernel is allocating 16kB for its kernel stack; that is a regular allocation, not atomic. If you decide the gfp_mask value you'll see that the kernel is allowed to initiate I/O and perform filesystem operations to satisfy the allocation, which it apparently did not.

I do recommend reporting it, it will help others avoid encountering the same problem if it gets fixed.


On 02/06/2017 03:07 PM, Benjamin Roth wrote:
Thanks for the reply. We got rid of the OOMs by increasing vm.min_free_kbytes, it's default of approx 90mb is maybe a bit low for systems with 128GB. I guess the OOM happens because the kernel could not reclaim enough paged memory instantly. I can't tell if this is really a kernel bug or not. It also was my first thought but in the end the main thing is, it works again and it does with more mibn_free_kbytes

2017-02-06 11:53 GMT+01:00 Avi Kivity <a...@scylladb.com <mailto:a...@scylladb.com>>:


    On 01/26/2017 07:36 AM, Benjamin Roth wrote:
    Hi there,

    We installed 2 new nodes these days. They run on ubuntu (Ubuntu
    16.04.1 LTS) with kernel 4.4.0-59-generic. On these nodes (and
    only on these) CS gets killed by the kernel due to OOM. It seems
    very strange to me because, CS only takes roughly 20GB (out of
    128GB), most of RAM is allocated to page cache.

    Top looks typically like this:
    KiB Mem : 13191691+total,  1974964 free, 20278184 used,
    10966376+buff/cache
    KiB Swap:        0 total,        0 free,    0 used.
    11051503+avail Mem

    This is what kern.log says:
    https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9
    <https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9>

    Has anyone encountered sth like this before?


    2017-01-26T03:10:45.679458+00:00 cas10 kernel: [52226.449989] Node
    0 Normal: 33850*4kB (UMEH) 8*8kB (UMH) 1*16kB (H) 0*32kB 0*64kB
    0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 135480kB
    2017-01-26T03:10:45.679460+00:00 cas10 kernel: [52226.449995] Node
    1 Normal: 34213*4kB (UME) 176*8kB (UME) 0*16kB 0*32kB 0*64kB
    0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 138260kB


    There is plenty of free memory left (33850+34213)*4kB = 270 MB,
    but it is fragmented into 4k and 8k blocks, while the kernel is
    trying to allocate 16kB.  Still, the kernel could have evicted
    some page cache or swapped out anonymous memory.  You should
    report this to lkml, it is a kernel bug.



-- Benjamin Roth
    Prokurist

    Jaumo GmbH · www.jaumo.com <http://www.jaumo.com>
    Wehrstraße 46 · 73035 Göppingen · Germany
    Phone +49 7161 304880-6 <tel:07161%203048806> · Fax +49 7161
    304880-1 <tel:07161%203048801>
    AG Ulm · HRB 731058 · Managing Director: Jens Kammerer




--
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com <http://www.jaumo.com>
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Reply via email to