Re: [ceph-users] MDS uses up to 150 GByte of memory during journal replay

Yan, Zheng Sun, 06 Jan 2019 21:41:18 -0800

likely caused by http://tracker.ceph.com/issues/37399.


Regards
Yan, Zheng



On Sat, Jan 5, 2019 at 5:44 PM Matthias Aebi <ma...@dizmo.com> wrote:
>
> Hello everyone,
>
> We are running a small cluster on 5 machines with 48 OSDs / 5 MDSs / 5 MONs 
> based on Luminous 12.2.10 and Debian Stretch 9.6. When using a single MDS 
> configuration everything works fine and looking at the active MDS's memory, 
> as configured, it uses ~1 GByte of memory for cache:
>
> $ watch ceph tell mds.$(hostname) heap stats
>
> mds.e tcmalloc heap stats:------------------------------------------------
> MALLOC:     1172867096 ( 1118.5 MiB) Bytes in use by application
> MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
> MALLOC: +     39289912 (   37.5 MiB) Bytes in central cache freelist
> MALLOC: +     17245344 (   16.4 MiB) Bytes in transfer cache freelist
> MALLOC: +     34303760 (   32.7 MiB) Bytes in thread cache freelists
> MALLOC: +      5796032 (    5.5 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =   1269502144 ( 1210.7 MiB) Actual memory used (physical + swap)
> MALLOC: +     19775488 (   18.9 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =   1289277632 ( 1229.6 MiB) Virtual address space used
> MALLOC:
> MALLOC:          70430              Spans in use
> MALLOC:             17              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> -------------
> $ ceph versions
>
> {
>  "mon": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
> luminous (stable)": 5
>  },
>  "mgr": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
> luminous (stable)": 3
>  },
>  "osd": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
> luminous (stable)": 48
>  },
>  "mds": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
> luminous (stable)": 5
>  },
>  "overall": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
> luminous (stable)": 61
>  }
>
> -------------
> $ ceph -s
>
> cluster:
>  id:     .... c9024
>  health: HEALTH_OK
>
> services:
>  mon: 5 daemons, quorum a,b,c,d,e
>  mgr: libra(active), standbys: b, a
>  mds: cephfs-1/1/1 up  {0=e=up:active}, 1 up:standby-replay, 3 up:standby
>  osd: 48 osds: 48 up, 48 in
>
> data:
>  pools:   2 pools, 2052 pgs
>  objects: 44.44M objects, 52.3TiB
>  usage:   107TiB used, 108TiB / 216TiB avail
>  pgs:     2051 active+clean
>           1    active+clean+scrubbing+deep
>
> io:
>  client:   85.3KiB/s rd, 3.17MiB/s wr, 45op/s rd, 26op/s wr
> -------------
>
> However as soon as we use "ceph fs set cephfs max_mds 2" to add a second MDS 
> to the picture things get out of hand within seconds, although in a rather 
> unexpected way: The standby MDS server which is brought in works fine and 
> shown a normal memory consumption. However the two machines which are 
> starting to replay the journal in order to become standby servers start to 
> accumulate dozens of GByte of memory immediately and go up to about 150 GByte 
> of memory, almost immediately starting to use swap space, which brings load 
> up to about 80 within seconds and makes all other processes (mainly OSDs) 
> unreachable.
>
> As the machine becomes basically unreachable when this happens it is only 
> possible to get memory statistics when things start to wrong. After that it's 
> not possible to get a memory dump anymore as the OS as a whole gets blocked 
> by swapping.
>
> $ watch ceph tell mds.$(hostname) heap stats
>
> mds.a tcmalloc heap stats:------------------------------------------------
> MALLOC:    36113137024 (34440.2 MiB) Bytes in use by application
> MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
> MALLOC: +      7723144 (    7.4 MiB) Bytes in central cache freelist
> MALLOC: +      2523264 (    2.4 MiB) Bytes in transfer cache freelist
> MALLOC: +      2460024 (    2.3 MiB) Bytes in thread cache freelists
> MALLOC: +     41185472 (   39.3 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =  36167028928 (34491.6 MiB) Actual memory used (physical + swap)
> MALLOC: +      1417216 (    1.4 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =  36168446144 (34492.9 MiB) Virtual address space used
> MALLOC:
> MALLOC:          38476              Spans in use
> MALLOC:             13              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> -------------
>
> Please also find attached the zip'ed log file of one of the two new standby 
> MDSs when it is trying to replay the fs journal.
>
> As soon as the number of MDSs is set back to 1 (using "ceph fs set cephfs 
> max_mds 1" and "ceph mds deactivate 1") things start to calm down and the 
> cluster goes back to normal. Is this a known problem with Luminous and what 
> can be done to solve it anyway so the multi MDS feature may be used?
>
> As all servers used here are based on Debian it is unfortunately not possible 
> to upgrade to Mimic as it seems that this cannot be / will not be made 
> available for Debian Stretch due to the tool chain issue described elsewhere.
>
> Thank you for any help and pointers in the right direction!
>
> Best,
> Matthias
>
> ----------------------------------------------------------------------------------------------------
> dizmo - The Interface of Things
> http://www.dizmo.com, Phone +41 52 267 88 50, Twitter @dizmos
> dizmo inc, Universitätsstrasse 53, CH-8006 Zurich, Switzerland
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS uses up to 150 GByte of memory during journal replay

Reply via email to