On Wed, May 27, 2020 at 10:09 PM Dylan McCulloch <d...@unimelb.edu.au> wrote: > > Hi all, > > The single active MDS on one of our Ceph clusters is close to running out of > RAM. > > MDS total system RAM = 528GB > MDS current free system RAM = 4GB > mds_cache_memory_limit = 451GB > current mds cache usage = 426GB
This mds_cache_memory_limit is way too high for the available RAM. We normally recommend that your RAM be 150% of your cache limit but we lack data for such large cache sizes. > Presumably we need to reduce our mds_cache_memory_limit and/or > mds_max_caps_per_client, but would like some guidance on whether it’s > possible to do that safely on a live production cluster when the MDS is > already pretty close to running out of RAM. > > Cluster is Luminous - 12.2.12 > Running single active MDS with two standby. > 890 clients > Mix of kernel client (4.19.86) and ceph-fuse. > Clients are 12.2.12 (398) and 12.2.13 (3) v12.2.12 has the changes necessary to throttle MDS cache size reduction. You should be able to reduce mds_cache_memory_limit to any lower value without destabilizing the cluster. > The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". > But the ceph-fuse clients appear to hold very large numbers according to the > ceph-fuse asok. > e.g. > “num_caps”: 1007144398, > “num_caps”: 1150184586, > “num_caps”: 1502231153, > “num_caps”: 1714655840, > “num_caps”: 2022826512, This data from the ceph-fuse asok is actually the number of caps ever received, not the current number. I've created a ticket for this: https://tracker.ceph.com/issues/45749 Look at the data from `ceph tell mds.foo session ls` instead. > Dropping caches on the clients appears to reduce their cap usage but does not > free up RAM on the MDS. The MDS won't free up RAM until the cache memory limit is reached. > What is the safest method to free cache and reduce RAM usage on the MDS in > this situation (without having to evict or remount clients)? reduce mds_cache_memory_limit > I’m concerned that reducing mds_cache_memory_limit even in very small > increments may trigger a large recall of caps and overwhelm the MDS. That used to be the case in older versions of Luminous but not any longer. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io