Hi Kasper,

I don't know if it will chew up the disproportionate amount of memory now,
but it will do so next time. So, I recommend connecting a large-enough SSD
now and adding swap when you see the excessive memory consumption.

Regarding the ticket, yes, it exists: I found
https://tracker.ceph.com/issues/71167 and
https://tracker.ceph.com/issues/71136 and
https://tracker.ceph.com/issues/52715 but I am sure more such tickets
exist, and there is no single root cause.

On Wed, Jun 18, 2025 at 3:41 PM Kasper Rasmussen <
kasper_steenga...@hotmail.com> wrote:

> Hi Alexander
>
> Thanks man
>
> Forgot to mention ceph version is 18.2.7
>
> Is this described anywhere - bug tracker / docs?
>
> Also when you write: "The same huge-swap recommendation applies to the
> recovery operations."
>
> Should I - If I fail over the MDS in current state - expect that it will
> chew away on huge amounts of RAM, requiring me to add 1TB swap?
>
> BR. Kasper
>
>
> ------------------------------
> *From:* Alexander Patrakov <patra...@gmail.com>
> *Sent:* Wednesday, June 18, 2025 09:11
> *To:* Kasper Rasmussen <kasper_steenga...@hotmail.com>
> *Cc:* ceph-users <ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] CephFS scrub resulting in MDS_CACHE_OVERSIZED
>
> Hello Kasper,
>
> This is known. Next time, please add at least 1 TB of swap before the
> scrub, and ignore the warning while the MDS is chewing through all the
> directories and files.
>
> The same huge-swap recommendation applies to the recovery operations.
>
> On Wed, Jun 18, 2025 at 3:01 PM Kasper Rasmussen <
> kasper_steenga...@hotmail.com> wrote:
>
> After starting a recursive scrub on a cephfs with alot of files the MDS
> cache went oversized.
>
> Scrub command: ceph... scrub start / recursive,repair,force
>
> I kept an eye on the MDS memory usage - since I was warned that it might
> go crazy.. and after 2-3 hours, I started getting the warning
>
> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
>     mds.generic-mds.<host>.asddje(mds.0): MDS cache is too large
> (63GB/36GB); 1250394 inodes in use by clients, 28888 stray files
>
>
> I then paused the scrub, resulting in scrub status
>
>
> {
>
>     "status": "PAUSED (22837086 inodes in the stack)",
>
>     "scrubs": {
>
>         "27f0e32a-bc8c-443d-b1f0-534474798ddf": {
>
>             "path": "/",
>
>             "tag": "27f0e32a-bc8c-443d-b1f0-534474798ddf",
>
>             "options": "recursive,repair,force"
>
>         }
>
>     }
>
> }
>
> and expected the cache size to go down again - but it didn't.
> After +12 hours with no change, I opted to abort the scrub - again
> expecting that the inodes in the stack would be offloaded from memory.
>
> The status after abort command:
>
> {
>
>     "status": "PAUSED (0 inodes in the stack)",
>
>     "scrubs": {}
>
> }
>
> But still no changes to the cache size.
>
> Since the status after the abort command had "PAUSED" in it, I resumed the
> scrub, resulting in status:
>
> {
>
>     "status": "no active scrubs running",
>
>     "scrubs": {}
>
> }
>
> Still no changes to the cache size.
>
> The log from the MDS in standard log level was:
>
> debug 2025-06-03T06:48:24.122+0000 7f319065d640 1
> mds.generic-mds.<host>.asddje asok_command: scrub start
> {path=/,prefix=scrub start,scrubops=[recursive,repair,force]} (starting...)
> debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log
> [INF] : scrub queued for path: /
> debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log
> [INF] : scrub summary: idle+waiting paths [/]
> debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log
> [INF] : scrub summary: active paths [/]
> debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
> mds.0.cache.dir(0x10041e16a55) mismatch between head items and
> fnode.fragstat! printing dentries
> debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
> mds.0.cache.dir(0x10041e16a55) get_num_head_items() = 38;
> fnode.fragstat.nfiles=28 fnode.fragstat.nsubdirs=11
> debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
> mds.0.cache.dir(0x10041e16a55) mismatch between child accounted_rstats and
> my rstats!
> debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
> mds.0.cache.dir(0x10041e16a55) total of child dentries: n(v0
> rc2025-06-03T06:48:11.042059+0000 b1661845634 127=95+32)
> debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
> mds.0.cache.dir(0x10041e16a55) my rstats: n(v544237
> rc2025-06-03T06:48:11.042059+0000 b1661845650 128=96+32)
> debug 2025-06-03T06:49:38.689+0000 7f319065d640 1
> mds.generic-mds.<host>.asddje asok_command: scrub status {prefix=scrub
> status} (starting...)
> debug 2025-06-03T06:51:49.782+0000 7f319065d640 1
> mds.generic-mds.<host>.asddje asok_command: scrub status {prefix=scrub
> status} (starting...)
> debug 2025-06-03T06:55:39.654+0000 7f319065d640 1
> mds.generic-mds.<host>.asddje asok_command: scrub status {prefix=scrub
> status} (starting...)
> debug 2025-06-03T07:00:56.205+0000 7f319065d640 1
> mds.generic-mds.<host>.asddje asok_command: scrub status
> ..
> ..
> From here it's either
> - asok_command: scrub status {prefix=scrub status} (starting...)
> - Updating MDS map to version xxxxxx from mon.3
> Until I pause the scrub.
>
> Extracts from the perf dump from the MDS:
>
> "mds": {
> ..
> ..
> ..
> "inodes": 23121955,
> "inodes_top": 3684,
> "inodes_bottom": 1728,
> "inodes_pin_tail": 23116543,
> "inodes_pinned": 23116691,
> "inodes_expired": 39049803601,
> "inodes_with_caps": 84593,
> ..
> ..
>
> }
> ..
> ..
> "mds_mem": {
>      "ino": 23114378,
>      "ino+": 38966647328,
>      "ino-": 38943532950,
>      "dir": 513065,
>      "dir+": 130921896,
>      "dir-": 130408831,
>      "dn": 23121954,
>      "dn+": 39349549680,
>      "dn-": 39326427726,
>      "cap": 87280,
>      "cap+": 6964477825,
>      "cap-": 6964390545,
>      "rss": 79730620,
>      "heap": 223508
> },
>
>
> I've have been reluctant to just fail the MDS, to clear the memory, but
> when I finally came around to do so I got the error
>
> "Error EPERM: MDS has one of two health warnings which could extend
> recovery: MDS_TRIM or MDS_CACHE_OVERSIZED. MDS failover is not recommended
> since it might cause unexpected file system unavailability. If you wish to
> proceed, pass --yes-i-really-mean-it"
>
> At this moment the number strays reported in the MDS_CACHE_OVERSIZED
> warning, are now up with a factor 10 (approx. 280000)
>
> Which made me pause.
> This seems like a bug.. But To be honest I don't know quite what to
> expect, if I just execute with "--yes-i-really-mean-it"..
> Will the MDS eat huge amount of RAM during replay? (I've seen this before
> during failover - where MDS ate almost 200GB ram, even though the cache was
> not oversized.)
> Any advice on how to proceed?
>
> BR. Kasper
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Alexander Patrakov
>


-- 
Alexander Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to