Hi Frank,
Are you able to share any logs from the mds that's crashing? And just to
confirm the rank goes into up:active before eventually OOM ?
This sounds familiar-ish but i'm also recovering after a nearly 24 hour
bender of another ceph related recovery.....trying to rack my brain of
similar issues we've seen.
Is there much swap space available to the node as well? In the event the
daemon is actually making progress but just has lack of resources you
may need to extend the time it can remain up with swap.
Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
On 1/10/25 13:30, Frank Schilder wrote:
Hi all,
we seem to have a serious issue with our file system, ceph version is pacific
latest. After a large cleanup operation we had an MDS rank with 100Mio stray
entries (yes, one hundred million). Today we restarted this daemon, which
cleans up the stray entries. It seems that this leads to a restart loop due to
OOM. The rank becomes active and then starts pulling in DNS and INOS entries
until all memory is exhausted.
I have no idea if there is at least progress removing the stray items or if it
starts from scratch every time. If it needs to pull as many DNS/INOS into cache
as there are stray items, we don't have a server at hand with enough RAM.
Q1: Is the MDS at least making progress in every restart iteration?
Q2: If not, how do we get this rank up again?
Q3: If we can't get this rank up soon, can we at least move directories away
from this rank by pinning it to another rank?
Currently, the rank in question reports .mds_cache.num_strays=0 in perf dump.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io