BACK ONLINE! Thanks to all for helping on this one. Posting this in case other experience this issue.
With some help, the issue was fixed with the following steps (In this setup I only had one mds for the fs) cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:0 journal inspect All OK journal back failed no matter what I tried - like this (and generally switching the params around) cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:0 journal export backup.bin cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:all journal export backup.bin It failed with "Error ((22) Invalid argument)" no matter what I tried - on my test env. it worked fine Seemed to be no other option than Recover and Reset journal. cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:0 event recover_dentries summary cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:0 journal reset All OK though it took approx. 30 mins to complete the recover_dentries action The MDS was then restarted and the FS now went active! HURRAY! The rest here is somewhat from memory, to take it with a grain of salt - Some scrubs to id and fix issues ceph --cluster <mycluster> tell mds.<mdsname> scrub start / ceph --cluster <mycluster> tell mds.<mdsname> scrub start /problemdir/problemfile ceph --cluster <mycluster> tell mds.<mdsname> scrub start /problemdir/problemfile repair ceph --cluster <mycluster> tell mds.<mdsname> scrub start ~mdsdir ceph --cluster <mycluster> tell mds.<mdsname> scrub start ~mdsdir recursive ceph --cluster <mycluster> tell mds.<mdsname> scrub start ~mdsdir recursive,repair + a MDS reboot or two.. To avoid the issue related to bug 61009 in the future, the param mds_client_delegate_inos_pct was set 0 (as proposed by Enrico) and all my MDSs was rebooted to pick the setting up. ________________________________ From: Enrico Bocchi <enrico.boc...@cern.ch> Sent: Thursday, May 22, 2025 12:03 To: ceph-users@ceph.io <ceph-users@ceph.io> Subject: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - Unable to get CephFS Active Hi Kasper, There has been a report on the mailing list about a successful recovery from this bug: https://www.spinics.net/lists/ceph-users/msg83077.html 61009 is indeed the tracker where this bug is being discussed/triaged. PR53752 was an attempt to fix, which turned out not to be necessary: See https://github.com/ceph/ceph/pull/53752#issuecomment-1963441022 and https://tracker.ceph.com/issues/61009#note-34 To avoid hitting this in the future, you should se `wsync` as mount option for your kernel clients and/or set `mds_client_delegate_inos_pct` to 0 (from experience, it requires MDS restart, though never tried on 18.2.7). "and/or" is because it has never been clear to me if setting either of the two is sufficient to reach the goal. Cheers, Enrico On 5/22/25 09:31, Kasper Rasmussen wrote: > Ok, so I raised the log level for the following - > debug_mds_log 5 > debug_mds 5 > debug_ms 5 > debug_monc 5 > > Still from what I can see, it starts replaying the mds log > > -2547> 2025-05-21T07:40:02.037+0000 7fa62b1f6640 2 mds.0.4402858 Booting: > 2: replaying mds log > -2546> 2025-05-21T07:40:02.037+0000 7fa62b1f6640 2 mds.0.4402858 Booting: > 2: waiting for purge queue recovered > > and durring that it crashes with the error as described in the META file > > "assert_thread_name": "mds-log-replay", > "assert_msg": > "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: > In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) > [with T = inodeno_t; C = std::map]' thread 7fe9ab7d7640 time > 2025-05-19T13:28:28.390758+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: > 568: FAILED ceph_assert(p->first <= start)\n", > > I'm not able to spot any "smoking gun" in what happens up till then. > > > My suspicion pointed towards bug tracker > https://tracker.ceph.com/issues/61009 since the error seems identical, though > I just spottet, that the threadname is slightly different - > > Bug tracker: "assert_thread_name": "md_log_replay" > My issue: "assert_thread_name": "mds-log-replay" > > Might be a typo or the name just changed? Or the name just changed between > Quincy and Reef > > It lead me to the pull request: https://github.com/ceph/ceph/pull/53752 > mentioned in the 61009 bug tracker. > > The developer has been very polite and helpful so far, but I think it would > be better to bring it in here, so other might find help in the future. > > Anyway - From my understanding there is two bugs in play 61009 and 66251. > > There a work around described in 61009 - "Just FYI - workaround for this > crash is to disable async dirops in the (kernel) client (mount option > `nowsync`)." > My understanding is that this is to avoid issue described in 66251 in the > first place - Can anyone confirm or deny this? > In 66251 there is a proposed WA - but my understanding is, that we will have > to wait for that WA to get implemented in code and released? - Can anyone > confirm or deny this? > > If I'm correct it seems like there is no fix as of now. > But maybe anyone can advice if a disaster recovery should/would help as a > work-around for the MDS crash during mds-log-replay? I think the following > steps as proposed by Eugen earlier (and as stated in > https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts) > - > > Take the FS down: > ceph fs fail <mycephfs> > > Backup of journal just in case: > cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:all journal > export backup.bin > > Disaster recovery steps - > cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:all event > recover_dentries summary > cephfs-journal-tool --cluster <mycluster> --rank=<mycephfs>:all journal reset > --yes-i-really-really-mean-it > cephfs-table-tool --cluster <mycluster> --rank=<mycephfs>:all reset session > > And then finally bring the FS back up. > > And lastly,, conclussion in regards to my understanding of WA on 61009 is > important in order to avoid this issue in the future. > > > > > > ________________________________ > From: Eugen Block <ebl...@nde.ag> > Sent: Wednesday, May 21, 2025 08:47 > To: Kasper Rasmussen <kasper_steenga...@hotmail.com> > Cc: Alexander Patrakov <patra...@gmail.com>; ceph-users@ceph.io > <ceph-users@ceph.io> > Subject: Re: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - Unable to > get CephFS Active > > You could increase the debug level and see if you get more > information, but be aware that MDS debug fills up the filesystem > incredibly fast. You could also add the dev mailing list as a > recipient, if you're subscribed. And then there's a cephfs slack > channel. > > Zitat von Kasper Rasmussen <kasper_steenga...@hotmail.com>: > >> Hi Eugen >> >> Thanks alot >> >> However I'm a bit concerned about the Disaster Recovery. >> >> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts >> >> In the pre-warnings for the DR it states - "It is essential to >> understand exactly what has gone wrong with your file system before >> attempting to repair it." >> >> Currently I have NO idea, what has gone wrong. And the error message >> seems to leave me non the wiser. >> I'm not in the mood for taking chances with this one ;-) >> >> Any Idea on how to get the attention of the devs? >> >> BR. Kasper >> >> >> >> >> ________________________________ >> From: Eugen Block <ebl...@nde.ag> >> Sent: Tuesday, May 20, 2025 15:51 >> To: Kasper Rasmussen <kasper_steenga...@hotmail.com> >> Cc: Alexander Patrakov <patra...@gmail.com>; ceph-users@ceph.io >> <ceph-users@ceph.io> >> Subject: Re: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - >> Unable to get CephFS Active >> >> In that case I would back up both journals, just in case. But I can't >> really comment on the tracker you found, I hope one of the devs can >> chime in if there's a quick fix or if you'll have to go through the >> advanced disaster recovery. >> Most of the time the DR steps seem to work well, so if you want to try >> to bring the FS back up quickly, that might be the way to go. It might >> suffice to reset the journal, here's an excerpt from a recent thread >> (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/N4MUFBEATLRRMQ45IIHVINOGNWDIEUGW/#ACTEN6ZHZRSGWGFNE47XRW4Y5OP5URVY): >> >> ---snip--- >> I was able to recover MDS by following the below steps. >> >> cephfs-journal-tool journal export backup.bin >> cephfs-journal-tool event recover_dentries summary >> cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal reset >> --yes-i-really-really-mean-it >> cephfs-table-tool all reset session >> ... >> I ran additional commands to bring mds live and healthy. >> "ceph fs reset-mds filesystem --yes-i-really-mean-it" >> "ceph mds fail filesystem:0" >> >> Now the cluster is healthy. >> ---snip--- >> >> But again, it might be caused by something else than in the mentioned >> thread, so there's no guarantee this will work. If there's enough >> space in the cluster, and a metadata pool is usually not too large, I >> would also suggest to copy the metadata pool to a backup pool as well. >> Just to be safe(r). >> >> >> Zitat von Kasper Rasmussen <kasper_steenga...@hotmail.com>: >> >>> Both commands >>> >>> cephfs-journal-tool --rank=<NAME-REDACTED>:all --journal=mdlog >>> journal inspect >>> cephfs-journal-tool --rank=<NAME-REDACTED>:all --journal=purge_queue >>> journal inspect >>> >>> return: >>> >>> Overall journal integrity: OK >>> >>> ________________________________ >>> From: Kasper Rasmussen <kasper_steenga...@hotmail.com> >>> Sent: Tuesday, May 20, 2025 09:48 >>> To: Eugen Block <ebl...@nde.ag>; Alexander Patrakov <patra...@gmail.com> >>> Cc: ceph-users@ceph.io <ceph-users@ceph.io> >>> Subject: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - >>> Unable to get CephFS Active >>> >>> One more finding - Don't know if it's related, but the amount of >>> concurrent scrub/deep-scrub operations has risen alot since the >>> upgrade from pacific to reef it seems.. >>> >>> From a usual 200-300 to now 800-1000. >>> >>> ________________________________ >>> From: Kasper Rasmussen <kasper_steenga...@hotmail.com> >>> Sent: Tuesday, May 20, 2025 09:16 >>> To: Eugen Block <ebl...@nde.ag>; Alexander Patrakov <patra...@gmail.com> >>> Cc: ceph-users@ceph.io <ceph-users@ceph.io> >>> Subject: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - >>> Unable to get CephFS Active >>> >>> I can also add the following - >>> >>> It's not an OOM issue. >>> The MDS host has 256 GB RAM >>> mds_cache_memory_limit is just below 39 GB >>> >>> No oom-killer in the logs. >>> >>> ceph fs dump returns: >>> >>> Filesystem '<NAME-REDACTED>' (2) >>> fs_name <NAME-REDACTED> >>> epoch 4402521 >>> flags 12 joinable allow_snaps allow_multimds_snaps >>> created 2022-03-21T08:22:52.262710+0000 >>> modified 2025-05-19T16:05:17.384954+0000 >>> tableserver 0 >>> root 0 >>> session_timeout 60 >>> session_autoclose 600 >>> max_file_size 4398046511104 >>> max_xattr_size 65536 >>> required_client_features {} >>> last_failure 0 >>> last_failure_osd_epoch 2046148 >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client >>> writeable ranges,3=default file layouts on dirs,4=dir inode in >>> separate object,5=mds uses versioned encoding,6=dirfrag is stored in >>> omap,7=mds uses inline data,8=no anchor table,9=file layout >>> v2,10=snaprealm v2} >>> max_mds 1 >>> in 0 >>> up {0=2125835186} >>> failed >>> damaged >>> stopped >>> data_pools [45] >>> metadata_pool 44 >>> inline_data disabled >>> balancer >>> bal_rank_mask -1 >>> standby_count_wanted 1 >>> >>> >>> So based on the Failed States description - >>> https://docs.ceph.com/en/reef/cephfs/mds-states/#failed-states >>> >>> I understand that the MDS at least did not report damaged/failed >>> meta-data - But that might be an optimistic intrepertation? >>> >>> ________________________________ >>> From: Kasper Rasmussen <kasper_steenga...@hotmail.com> >>> Sent: Tuesday, May 20, 2025 09:02 >>> To: Eugen Block <ebl...@nde.ag>; Alexander Patrakov <patra...@gmail.com> >>> Cc: ceph-users@ceph.io <ceph-users@ceph.io> >>> Subject: Re: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - >>> Unable to get CephFS Active >>> >>> I haven't tried any disaster recovery out yet. However I've found >>> this bug that looks like the issue. >>> >>> https://tracker.ceph.com/issues/61009 >>> >>> Seems like it's still open, and might have gone stale - Can anyone >>> comment on that in this channel? >>> >>> ________________________________ >>> From: Eugen Block <ebl...@nde.ag> >>> Sent: Tuesday, May 20, 2025 08:01 >>> To: Alexander Patrakov <patra...@gmail.com> >>> Cc: ceph-users@ceph.io <ceph-users@ceph.io> >>> Subject: [ceph-users] Re: MDS Repeatedly Crashing/Restarting - >>> Unable to get CephFS Active >>> >>> Hi, >>> >>> I don't think I've had to use a journal backup yet. Either the backup >>> of the journal failed because it was corrupted, or the disaster >>> recovery procedure worked out. >>> But assume that you would need to import the backup: >>> >>> cephfs-journal-tool [options] journal import <path> [--force] >>> >>> and then retry to recover the FS. But I can't remember either if >>> anyone on this list has reported to successfully restore the journal >>> from backup and then succefully recovered the FS in a second attempt. >>> >>> >>> Zitat von Alexander Patrakov <patra...@gmail.com>: >>> >>>> Hi Eugen, >>>> >>>> I have never seen any instructions on how to use such a backup if >>>> disaster recovery fails. Do you know the procedure? >>>> >>>> On Tue, May 20, 2025 at 1:23 AM Eugen Block <ebl...@nde.ag> wrote: >>>>> Hi, >>>>> >>>>> not sure if it was related to journal replay, but have you checked for >>>>> memory issues? What's the mds memory target? Any traces of an oom >>>>> killer? >>>>> >>>>> Next I would do is inspect the journals for both purge_queue and md_log: >>>>> >>>>> cephfs-journal-tool journal inspect --rank=<cephfs> --journal=md_log >>>>> cephfs-journal-tool journal inspect --rank=<cephfs> --journal=purge_queue >>>>> >>>>> The --rank and --journal parameters might be in the wrong place here, >>>>> I'm writing this without immediate access to a cephfs-journal-tool. >>>>> >>>>> In case the journals are okay, create a backup as described in the >>>>> docs [0]. Then you might have to go through the disaster recovery >>>>> steps (for this cephfs only). >>>>> >>>>> [0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery/ >>>>> >>>>> Zitat von Kasper Rasmussen <kasper_steenga...@hotmail.com>: >>>>> >>>>>> Ceph Version: 18.2.7 >>>>>> >>>>>> I've just migrated to cephadm, and upgrade from pacific to reef >>>>>> 18.2.7 last week. >>>>>> All successful except some minor issues with BlueFS Spillover >>>>>> >>>>>> >>>>>> Today the MDS of a specific fs refuse to start, and the ceph orch ps >>>>>> shows the daemons with status "error". >>>>>> I have three other cephfs that still works(though I haven't tested >>>>>> if they can fail over.) >>>>>> >>>>>> I've restartet the MDSs - No luck (the selected MDS just start/crash >>>>>> in a loop until it gives up) >>>>>> I've deployed 2 new MDSs - No luck same issue >>>>>> >>>>>> In all scenarios I see in ceph fs status, that a MDS is chosen. FS >>>>>> status goes to "replay" or "replay(laggy)" >>>>>> On the host with the MDS I see the MDS container just crashes after >>>>>> way less than 5 mins.. And status reported by ceph orch ps is error. >>>>>> >>>>>> (btw - mds_beacon_grace has been set to 360) >>>>>> >>>>>> I've managed to get a good 500 lines of log out with info like this: >>>>>> >>>>>> << ----------------- LOG EXAMPLE START ----------------- >> >>>>>> -7> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient: >>>>>> _check_auth_tickets >>>>>> -6> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient: >>>>>> _check_auth_rotating have uptodate secrets (they expire after >>>>>> 2025-05-19T16:04:32.845551+0000) >>>>>> -5> 2025-05-19T16:05:02.860+0000 7f673e3c1640 10 monclient: >>>>>> get_auth_request con 0x5616e9616c00 auth_method 0 >>>>>> -4> 2025-05-19T16:05:02.916+0000 7f673dbc0640 10 monclient: >>>>>> get_auth_request con 0x5616e7422800 auth_method 0 >>>>>> -3> 2025-05-19T16:05:02.968+0000 7f673d3bf640 10 monclient: >>>>>> get_auth_request con 0x5616f5eac800 auth_method 0 >>>>>> -2> 2025-05-19T16:05:02.972+0000 7f6736bb2640 2 mds.0.cache >>>>>> Memory usage: total 574800, rss 343772, heap 207124, baseline >>>>>> 182548, 0 / 7535 inodes have caps, 0 caps, 0 caps per inode >>>>>> -1> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 >>>>>> >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: >>>>> In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, >>>>> T)>) [with T = inodeno_t; C = std::map]' >>>>> thread >>>>> 7f67333ab640 >>>>> time >>>>>> 2025-05-19T16:05:03.680495+0000 >>>>>> >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: >>>>> 568: >>>>> FAILED >>>>> ceph_assert(p->first >>>>> <= >>>>>> start) >>>>>> >>>>>> ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) >>>>> reef (stable) >>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>>> const*)+0x11e) [0x7f67406e6d2c] >>>>>> 2: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb] >>>>>> 3: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe] >>>>>> 4: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745] >>>>>> 5: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>>>> MDPeerUpdate*)+0x4bdc) [0x5616e0709a4c] >>>>>> 6: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd] >>>>>> 7: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e] >>>>>> 8: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1] >>>>>> 9: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a] >>>>>> 10: clone() >>>>>> >>>>>> 0> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 *** Caught >>>>>> signal (Aborted) ** >>>>>> in thread 7f67333ab640 thread_name:mds-log-replay >>>>>> >>>>>> ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) >>>>> reef (stable) >>>>>> 1: /lib64/libc.so.6(+0x3ebf0) [0x7f674004bbf0] >>>>>> 2: /lib64/libc.so.6(+0x8bf5c) [0x7f6740098f5c] >>>>>> 3: raise() >>>>>> 4: abort() >>>>>> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>>> const*)+0x178) [0x7f67406e6d86] >>>>>> 6: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb] >>>>>> 7: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe] >>>>>> 8: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745] >>>>>> 9: (EMetaBlob::replay(MDSRank*, LogSegment*, int, >>>>>> MDPeerUpdate*)+0x4bdc) [0x5616e0709a4c] >>>>>> 10: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd] >>>>>> 11: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e] >>>>>> 12: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1] >>>>>> 13: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a] >>>>>> 14: clone() >>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>>>>> needed to interpret this. >>>>>> << ----------------- LOG EXAMPLE END ----------------- >> >>>>>> >>>>>> >>>>>> But to be honest, out of all those lines, I don't know what to >>>>>> provide (all +500 might be a bit to much) >>>>>> >>>>>> >>>>>> I really need this FS back online, so help will be very much appreciated >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>>> >>>> -- >>>> Alexander Patrakov >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Enrico Bocchi CERN European Laboratory for Particle Physics IT - Storage & Data Management - General Storage Services Mailbox: G20500 - Office: 31-2-010 1211 Genève 23 Switzerland _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io