Hi Eugen, I was able to recover MDS by following the below steps.
cephfs-journal-tool journal export backup.bin cephfs-journal-tool event recover_dentries summary cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal reset --yes-i-really-really-mean-it cephfs-table-tool all reset session After running above, I saw a few errors in the MDS log. [ERR] loaded dup inode 0x100085efb8e... [ERR] bad backtrace on directory inode 0x100085efaa3 I ran additional commands to bring mds live and healthy. "ceph fs reset-mds filesystem --yes-i-really-mean-it" "ceph mds fail filesystem:0" Now the cluster is healthy. On Thu, Apr 17, 2025 at 8:01 PM Amudhan P <amudha...@gmail.com> wrote: > Okay Thanks, Eugen. > > On Thu, Apr 17, 2025 at 6:49 PM Eugen Block <ebl...@nde.ag> wrote: > >> Those tools are specifically for a disaster recovery, meaning there's >> no mds running (otherwise you wouldn't need them ;-) ). >> >> >> What would this command do ? >> >> ceph fs reset-mds filesystem --yes-i-really-mean-it >> >> I can only quote the docs, I don't know what exactly happens in the >> background, I guess the devs would need to answer that. But maybe you >> don't even need it, sometimes it's sufficient to reset the journal. >> >> A few months ago I asked to improve the disaster recovery docs since >> it's not really clear what exactly the described steps do and how >> dangerous which operation is. Zac created a tracker for this, but I >> haven't checked in a while for any progress. But most of the times (at >> least that's my perception) the recovery seems to work. >> >> Zitat von Amudhan P <amudha...@gmail.com>: >> >> > Also, I don't have any active mds running, not sure if I can do all the >> > recovery steps. >> > >> > On Thu, Apr 17, 2025 at 5:42 PM Amudhan P <amudha...@gmail.com> wrote: >> > >> >> I didn't face any network related issues but in my email start. I have >> >> mentioned about one of the mon node was out of quorum. >> >> Even after restart of mon node had few issues in log like slow ops in >> mon >> >> node and eventually it cleared but same mon node went out of quorum and >> >> comes back online. During the above said issue observed 40% cpu in >> wait in >> >> top command in the same node where mon had issue. But there is no >> errors >> >> logged in dmesg related to drive error or network. Only after this mon >> >> error may be after 24 hours MDS failed. MON issue cleared now after all >> >> nodes were reboot cycled. >> >> >> >> Yeah, disaster recovery will be performed only as last resort. >> >> >> >> What would this command do ? >> >> >> >> ceph fs reset-mds filesystem --yes-i-really-mean-it >> >> >> >> >> >> On Thu, Apr 17, 2025 at 5:19 PM Eugen Block <ebl...@nde.ag> wrote: >> >> >> >>> Was there any issue in your network or anything? Something that would >> >>> explain the MDS crash? I would recommend to scan syslog, dmesg etc. >> >>> for anything suspicious. If you don't find anything, you might need to >> >>> go through >> >>> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ to >> >>> get your CephFS back up. >> >>> >> >>> Don't forget to create a backup of the journals. Read the instructions >> >>> carefully, you might not need to do all the mentioned steps. I would >> >>> start with (for both journals where applicable): >> >>> >> >>> cephfs-journal-tool journal export backup.bin >> >>> cephfs-journal-tool event recover_dentries summary >> >>> cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal reset >> >>> --yes-i-really-really-mean-it >> >>> cephfs-table-tool all reset session >> >>> >> >>> But again, read the instructions carefully. This procedure is usually >> >>> the last resort, I would first try to find out what happened causing >> >>> this situation. >> >>> >> >>> Zitat von Amudhan P <amudha...@gmail.com>: >> >>> >> >>> > I don't think I have a memory issue. >> >>> > Sorry, for sending log as file Pastebin is not working due to size >> >>> > limitation. >> >>> > >> >>> > Crash info >> >>> > >> >>> >> ceph crash info >> >>> >> 2025-04-17T08:50:35.931485Z_0109f173-206c-471e-afac-c0d3e0aa2785 >> >>> >> { >> >>> >> "backtrace": [ >> >>> >> "/lib64/libpthread.so.0(+0x12b20) [0x7fbd42bbdb20]", >> >>> >> "/usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) >> >>> [0x7fbd4c81da00]" >> >>> >> ], >> >>> >> "ceph_version": "16.2.6", >> >>> >> "crash_id": >> >>> >> "2025-04-17T08:50:35.931485Z_0109f173-206c-471e-afac-c0d3e0aa2785", >> >>> >> "entity_name": "mds.mummasstrg.strg-node1.gchapr", >> >>> >> "os_id": "centos", >> >>> >> "os_name": "CentOS Linux", >> >>> >> "os_version": "8", >> >>> >> "os_version_id": "8", >> >>> >> "process_name": "ceph-mds", >> >>> >> "stack_sig": >> >>> >> "5238fe1be0b82b479b10ee7d17b5ad3182cdc93ede581af63d627a472a4fcf9e", >> >>> >> "timestamp": "2025-04-17T08:50:35.931485Z", >> >>> >> "utsname_hostname": "strg-node1", >> >>> >> "utsname_machine": "x86_64", >> >>> >> "utsname_release": "5.10.0-8-amd64", >> >>> >> "utsname_sysname": "Linux", >> >>> >> "utsname_version": "#1 SMP Debian 5.10.46-5 (2021-09-23)" >> >>> >> } >> >>> >> >> >>> > >> >>> > On Thu, Apr 17, 2025 at 2:55 PM Eugen Block <ebl...@nde.ag> wrote: >> >>> > >> >>> >> Oh right, I just noticed you had the status in your first message. >> Is >> >>> >> there any chance the MDS is out of memory? Can you paste a complete >> >>> >> (fresh) startup log on some pastebin or so? Also, a complete 'ceph >> >>> >> crash info <crash>' could be useful. >> >>> >> >> >>> >> Zitat von Amudhan P <amudha...@gmail.com>: >> >>> >> >> >>> >> > I have run the status and stat command below is the output. >> >>> >> > >> >>> >> > ceph -s >> >>> >> > >> >>> >> > cluster: >> >>> >> > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a >> >>> >> > health: HEALTH_WARN >> >>> >> > 2 failed cephadm daemon(s) >> >>> >> > 1 filesystem is degraded >> >>> >> > insufficient standby MDS daemons available >> >>> >> > 7 daemons have recently crashed >> >>> >> > >> >>> >> > services: >> >>> >> > mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age >> 20h) >> >>> >> > mgr: strg-node2.unyimy(active, since 20h), standbys: >> >>> >> strg-node1.ivkfid >> >>> >> > mds: 1/1 daemons up >> >>> >> > osd: 32 osds: 32 up (since 20h), 32 in (since 10w) >> >>> >> > >> >>> >> > data: >> >>> >> > volumes: 0/1 healthy, 1 recovering >> >>> >> > pools: 3 pools, 321 pgs >> >>> >> > objects: 15.49M objects, 54 TiB >> >>> >> > usage: 109 TiB used, 66 TiB / 175 TiB avail >> >>> >> > pgs: 317 active+clean >> >>> >> > 4 active+clean+scrubbing+deep >> >>> >> > >> >>> >> > >> >>> >> > ceph mds stat >> >>> >> > mumstrg:1/1 {0=mumstrg.strg-node1.gchapr=up:replay(laggy or >> crashed)} >> >>> >> > >> >>> >> > ceph osd lspools >> >>> >> > 1 device_health_metrics >> >>> >> > 2 cephfs.mumstrg.meta >> >>> >> > 3 cephfs.mumstrg.data >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > On Thu, Apr 17, 2025 at 10:33 AM Eugen Block <ebl...@nde.ag> >> wrote: >> >>> >> > >> >>> >> >> What’s your overall Ceph status? It says data pool 3 not found. >> >>> >> >> >> >>> >> >> Zitat von Amudhan P <amudha...@gmail.com>: >> >>> >> >> >> >>> >> >> > There are few more logs in MDS. I have highlighted few log >> lines >> >>> >> which I >> >>> >> >> am >> >>> >> >> > not sure what it is. >> >>> >> >> > >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -79> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 >> asok(0x560a2c44e000) >> >>> >> >> > register_command dump inode hook 0x560a2c354580 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -78> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 >> asok(0x560a2c44e000) >> >>> >> >> > register_command exit hook 0x560a2c354580 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -77> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 >> asok(0x560a2c44e000) >> >>> >> >> > register_command respawn hook 0x560a2c354580 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -76> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 >> asok(0x560a2c44e000) >> >>> >> >> > register_command heap hook 0x560a2c354580 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -75> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74b5030700 1 >> >>> >> >> mds.mumstrg.strg-node3.xhxbwx >> >>> >> >> > Updating MDS map to version 127517 f >> >>> >> >> > rom mon.2 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -74> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 >> asok(0x560a2c44e000) >> >>> >> >> > register_command cpu_profiler hook 0x560a2c35458 >> >>> >> >> > >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -73> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74b302c700 5 >> >>> >> >> > mds.beacon.mumstrg.strg-node3.xhxbwx Sending beacon up:boot >> seq 1 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -72> >> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74b302c700 10 monclient: >> >>> >> _send_mon_message >> >>> >> >> > to mon.strg-node3 at v2:10.0.103.3:3300/ >> >>> >> >> > >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -71> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 >> >>> >> >> mds.mumstrg.strg-node3.xhxbwx >> >>> >> >> > Updating MDS map to version 127518 f >> >>> >> >> > rom mon.2 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -70> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: >> >>> _renew_subs >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -69> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: >> >>> >> _send_mon_message >> >>> >> >> > to mon.strg-node3 at v2:10.0.103.3:3300/ >> >>> >> >> > >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -68> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue >> >>> >> >> operator(): >> >>> >> >> > data pool 3 not found in OSDMap >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -67> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 5 >> asok(0x560a2c44e000) >> >>> >> >> > register_command objecter_requests hook 0x560a2c >> >>> >> >> > 3544c0 >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -66> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: >> >>> _renew_subs >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -65> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: >> >>> >> _send_mon_message >> >>> >> >> > to mon.strg-node3 at v2:10.0.103.3:3300/ >> >>> >> >> > >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -64> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 >> log_channel(cluster) >> >>> >> >> > update_config to_monitors: true to_syslog: false >> >>> >> >> > syslog_facility: daemon prio: info to_graylog: false >> >>> graylog_host: >> >>> >> >> > 127.0.0.1 graylog_port: 12201) >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -63> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue >> >>> >> >> operator(): >> >>> >> >> > data pool 3 not found in OSDMap >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -62> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.0 >> >>> handle_osd_map >> >>> >> epoch >> >>> >> >> > 0, 0 new blocklist entries >> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -61> >> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 >> >>> >> handle_mds_map >> >>> >> >> i >> >>> >> >> > am now mds.0.127518 >> >>> >> >> > >> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -60> >> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 >> >>> >> handle_mds_map >> >>> >> >> >> state change up:boot --> up:replay >> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -59> >> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 5 >> >>> >> >> >> mds.beacon.mummasstrg.strg-node3.xhxbwx set_want_state: >> up:boot >> >>> -> >> >>> >> >> up:replay >> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -58> >> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 >> >>> >> replay_start >> >>> >> >> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -57> >> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 >> >>> waiting >> >>> >> for >> >>> >> >> >> osdmap 45749 (which blocklists prior instance)* >> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -56> >> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: >> >>> >> >> _send_mon_message >> >>> >> >> >> to mon.strg-node3 at v2:10.0.103.3:3300/0 >> >>> >> >> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -55> >> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 >> mds.0.purge_queue >> >>> >> >> operator(): >> >>> >> >> >> data pool 3 not found in OSDMap* >> >>> >> >> >> >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > On Thu, Apr 17, 2025 at 7:06 AM Amudhan P < >> amudha...@gmail.com> >> >>> >> wrote: >> >>> >> >> > >> >>> >> >> >> Eugen, >> >>> >> >> >> >> >>> >> >> >> This is the output for the command >> >>> >> >> >> cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue >> >>> journal >> >>> >> >> >> inspect >> >>> >> >> >> Overall journal integrity: OK >> >>> >> >> >> cephfs-journal-tool --rank=mumstrg:all --journal=mdlog >> journal >> >>> >> inspect >> >>> >> >> >> Overall journal integrity: OK >> >>> >> >> >> >> >>> >> >> >> On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag> >> >>> wrote: >> >>> >> >> >> >> >>> >> >> >>> I think either your mdlog or the purge_queue journal is >> >>> corrupted: >> >>> >> >> >>> >> >>> >> >> >>> 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 >> >>> Booting: >> >>> >> 2: >> >>> >> >> >>> waiting for purge queue recovered >> >>> >> >> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> >> >>> >> >> >>> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> con 0x562856a25400 auth_method 0 >> >>> >> >> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> >> >>> >> >> >>> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught >> signal >> >>> >> >> >>> (Segmentation fault) ** >> >>> >> >> >>> Apr 16 15:29:30 strg-node4 bash[7566]: in thread >> 7f43ce06f700 >> >>> >> >> >>> thread_name:md_log_replay >> >>> >> >> >>> >> >>> >> >> >>> Can you paste the output of this command? >> >>> >> >> >>> >> >>> >> >> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all >> >>> --journal=purge_queue >> >>> >> >> >>> journal inspect >> >>> >> >> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all >> --journal=mdlog >> >>> >> journal >> >>> >> >> >>> inspect >> >>> >> >> >>> >> >>> >> >> >>> I expect one or more damaged entries. Check this thread for >> more >> >>> >> >> details: >> >>> >> >> >>> >> >>> >> >> >>> https://www.spinics.net/lists/ceph-users/msg80124.html >> >>> >> >> >>> >> >>> >> >> >>> You should try to backup the journal, but in my case that >> wasn't >> >>> >> >> >>> possible, so I had no other choice than resetting it. >> >>> >> >> >>> >> >>> >> >> >>> Zitat von Amudhan P <amudha...@gmail.com>: >> >>> >> >> >>> >> >>> >> >> >>> > Hi, >> >>> >> >> >>> > >> >>> >> >> >>> > I am having 2 problems with my Ceph version 16.2.6 >> >>> >> >> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific >> (stable) >> >>> >> deployed >> >>> >> >> >>> thru >> >>> >> >> >>> > cephadm. >> >>> >> >> >>> > >> >>> >> >> >>> > First issue :- >> >>> >> >> >>> > 1 out 3 mon service went out of quorum . >> >>> >> >> >>> > When restarted service it comes normal but after a few >> >>> minutes in >> >>> >> >> ceph >> >>> >> >> >>> > watch log it reports slow ops and mon goes out of quorum. >> >>> >> >> >>> > Node where this mon service failed had one weird thing >> that I >> >>> >> could >> >>> >> >> see >> >>> >> >> >>> 40% >> >>> >> >> >>> > of wait in the top command. But I don't see any error in >> >>> dmesg or >> >>> >> >> >>> anything >> >>> >> >> >>> > related to drive IO error. >> >>> >> >> >>> > Below are the logs that were printed in ceph watch >> command. >> >>> >> >> >>> > >> >>> >> >> >>> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] >> >>> >> MON_DOWN: >> >>> >> >> 1/3 >> >>> >> >> >>> > mons down, quorum strg-node2,strg-node3 >> >>> >> >> >>> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN] >> >>> >> >> mon.strg-node1 >> >>> >> >> >>> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] >> is >> >>> down >> >>> >> >> (out >> >>> >> >> >>> of >> >>> >> >> >>> > quorum) >> >>> >> >> >>> > >> >>> >> >> >>> > For now this is not appearing again. >> >>> >> >> >>> > >> >>> >> >> >>> > >> >>> >> >> >>> > Second issue Cephfs degraded :- >> >>> >> >> >>> > I have 2 MDS services running in 2 different nodes. Both >> are >> >>> in a >> >>> >> >> >>> stopped >> >>> >> >> >>> > state. >> >>> >> >> >>> > when running Ceph -s command >> >>> >> >> >>> > >> >>> >> >> >>> > cluster: >> >>> >> >> >>> > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a >> >>> >> >> >>> > health: HEALTH_WARN >> >>> >> >> >>> > 2 failed cephadm daemon(s) >> >>> >> >> >>> > 1 filesystem is degraded >> >>> >> >> >>> > insufficient standby MDS daemons available >> >>> >> >> >>> > >> >>> >> >> >>> > services: >> >>> >> >> >>> > mon: 3 daemons, quorum >> strg-node1,strg-node2,strg-node3 >> >>> (age >> >>> >> 4h) >> >>> >> >> >>> > mgr: strg-node2.unyimy(active, since 4h), standbys: >> >>> >> >> >>> strg-node1.ivkfid >> >>> >> >> >>> > mds: 1/1 daemons up >> >>> >> >> >>> > osd: 32 osds: 32 up (since 4h), 32 in (since 10w) >> >>> >> >> >>> > >> >>> >> >> >>> > data: >> >>> >> >> >>> > volumes: 0/1 healthy, 1 recovering >> >>> >> >> >>> > pools: 3 pools, 321 pgs >> >>> >> >> >>> > objects: 15.49M objects, 54 TiB >> >>> >> >> >>> > usage: 109 TiB used, 66 TiB / 175 TiB avail >> >>> >> >> >>> > pgs: 321 active+clean >> >>> >> >> >>> > >> >>> >> >> >>> > Volume shows recovering but there wasn't any progress >> till now >> >>> >> even >> >>> >> >> >>> manual >> >>> >> >> >>> > start mds service fails again. In Ceph -s command under >> >>> services >> >>> >> it >> >>> >> >> >>> shows >> >>> >> >> >>> > mds up no any mds service is running. >> >>> >> >> >>> > >> >>> >> >> >>> > Below is a log snip from one of the mds service. >> >>> >> >> >>> > >> >>> >> >> >>> > >> >>> >> >> >>> > -25> 2025-04-16T09:59:29.954+0000 >> 7f43d0874700 1 >> >>> >> >> >>> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim >> >>> 13967032320, >> >>> >> >> ex> >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -24> >> >>> >> >> >>> > 2025-04-16T09:59:29.954+0000 7f43d0874700 1 >> >>> >> mds.0.journaler.pq(ro) >> >>> >> >> >>> probing >> >>> >> >> >>> > for end of the log >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -23> >> >>> >> >> >>> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a17400 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -22> >> >>> >> >> >>> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a17c00 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -21> >> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 >> >>> >> >> mds.0.journaler.mdlog(ro) >> >>> >> >> >>> > recover start >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -20> >> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 >> >>> >> >> mds.0.journaler.mdlog(ro) >> >>> >> >> >>> > read_head >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -19> >> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log >> >>> Waiting for >> >>> >> >> >>> journal >> >>> >> >> >>> > 0x200 to recover... >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -18> >> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a25000 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -17> >> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 >> >>> >> mds.0.journaler.pq(ro) >> >>> >> >> >>> > _finish_probe_end write_pos = 13968309289 (hea> >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -16> >> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 4 >> mds.0.purge_queue >> >>> >> >> >>> operator(): >> >>> >> >> >>> > open complete >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -15> >> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 >> >>> >> mds.0.journaler.pq(ro) >> >>> >> >> >>> > set_writeable >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -14> >> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 >> >>> >> >> mds.0.journaler.mdlog(ro) >> >>> >> >> >>> > _finish_read_head loghead(trim 189741504921> >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -13> >> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 >> >>> >> >> mds.0.journaler.mdlog(ro) >> >>> >> >> >>> > probing for end of the log >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -12> >> >>> >> >> >>> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a25c00 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -11> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf872700 1 >> >>> >> >> mds.0.journaler.mdlog(ro) >> >>> >> >> >>> > _finish_probe_end write_pos = 1897428915052> >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -10> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log >> Journal >> >>> >> 0x200 >> >>> >> >> >>> > recovered. >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -9> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log >> >>> Recovered >> >>> >> >> journal >> >>> >> >> >>> > 0x200 in format 1 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -8> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506 >> >>> >> Booting: 1: >> >>> >> >> >>> > loading/discovering base inodes >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -7> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache >> >>> creating >> >>> >> >> system >> >>> >> >> >>> > inode with ino:0x100 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -6> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache >> >>> creating >> >>> >> >> system >> >>> >> >> >>> > inode with ino:0x1 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -5> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a25800 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -4> >> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a5dc00 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -3> >> >>> >> >> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 >> >>> >> Booting: 2: >> >>> >> >> >>> > replaying mds log >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -2> >> >>> >> >> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 >> >>> >> Booting: 2: >> >>> >> >> >>> > waiting for purge queue recovered >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> >> >>> >> >> >>> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: >> >>> >> >> get_auth_request >> >>> >> >> >>> > con 0x562856a25400 auth_method 0 >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> >> >>> >> >> >>> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught >> signal >> >>> >> >> >>> > (Segmentation fault) ** >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: in thread >> 7f43ce06f700 >> >>> >> >> >>> > thread_name:md_log_replay >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: ceph version >> 16.2.6 >> >>> >> >> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific >> (stable) >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 1: >> >>> >> >> >>> /lib64/libpthread.so.0(+0x12b20) >> >>> >> >> >>> > [0x7f43dd293b20] >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 2: >> >>> >> >> >>> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) >> >>> [0x7f43e6ef3a00] >> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of >> the >> >>> >> >> executable, >> >>> >> >> >>> or >> >>> >> >> >>> > `objdump -rdS <executable>` is needed to interpret this. >> >>> >> >> >>> > >> >>> >> >> >>> > >> >>> >> >> >>> > Not sure what caused the issue. I couldn't find any >> resources >> >>> to >> >>> >> fix >> >>> >> >> >>> this >> >>> >> >> >>> > issue. >> >>> >> >> >>> > Need help from someone to bring the ceph cluster online. >> >>> >> >> >>> > _______________________________________________ >> >>> >> >> >>> > ceph-users mailing list -- ceph-users@ceph.io >> >>> >> >> >>> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> _______________________________________________ >> >>> >> >> >>> ceph-users mailing list -- ceph-users@ceph.io >> >>> >> >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >>> >> >> >>> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >>> >> >>> >> >>> >> >> >> >> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io