There are few more logs in MDS. I have highlighted few log lines which I am not sure what it is.
Apr 16 20:13:59 strg-node3 bash[59114]: debug -79> 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) register_command dump inode hook 0x560a2c354580 Apr 16 20:13:59 strg-node3 bash[59114]: debug -78> 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) register_command exit hook 0x560a2c354580 Apr 16 20:13:59 strg-node3 bash[59114]: debug -77> 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) register_command respawn hook 0x560a2c354580 Apr 16 20:13:59 strg-node3 bash[59114]: debug -76> 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) register_command heap hook 0x560a2c354580 Apr 16 20:13:59 strg-node3 bash[59114]: debug -75> 2025-04-16T14:43:59.170+0000 7f74b5030700 1 mds.mumstrg.strg-node3.xhxbwx Updating MDS map to version 127517 f rom mon.2 Apr 16 20:13:59 strg-node3 bash[59114]: debug -74> 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) register_command cpu_profiler hook 0x560a2c35458 0 Apr 16 20:13:59 strg-node3 bash[59114]: debug -73> 2025-04-16T14:43:59.170+0000 7f74b302c700 5 mds.beacon.mumstrg.strg-node3.xhxbwx Sending beacon up:boot seq 1 Apr 16 20:13:59 strg-node3 bash[59114]: debug -72> 2025-04-16T14:43:59.170+0000 7f74b302c700 10 monclient: _send_mon_message to mon.strg-node3 at v2:10.0.103.3:3300/ 0 Apr 16 20:13:59 strg-node3 bash[59114]: debug -71> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.mumstrg.strg-node3.xhxbwx Updating MDS map to version 127518 f rom mon.2 Apr 16 20:13:59 strg-node3 bash[59114]: debug -70> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs Apr 16 20:13:59 strg-node3 bash[59114]: debug -69> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message to mon.strg-node3 at v2:10.0.103.3:3300/ 0 Apr 16 20:13:59 strg-node3 bash[59114]: debug -68> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue operator(): data pool 3 not found in OSDMap Apr 16 20:13:59 strg-node3 bash[59114]: debug -67> 2025-04-16T14:43:59.254+0000 7f74b5030700 5 asok(0x560a2c44e000) register_command objecter_requests hook 0x560a2c 3544c0 Apr 16 20:13:59 strg-node3 bash[59114]: debug -66> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs Apr 16 20:13:59 strg-node3 bash[59114]: debug -65> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message to mon.strg-node3 at v2:10.0.103.3:3300/ 0 Apr 16 20:13:59 strg-node3 bash[59114]: debug -64> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) Apr 16 20:13:59 strg-node3 bash[59114]: debug -63> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue operator(): data pool 3 not found in OSDMap Apr 16 20:13:59 strg-node3 bash[59114]: debug -62> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.0 handle_osd_map epoch 0, 0 new blocklist entries Apr 16 20:13:59 strg-node3 bash[59114]: debug -61> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 handle_mds_map i am now mds.0.127518 > Apr 16 20:13:59 strg-node3 bash[59114]: debug -60> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 handle_mds_map > state change up:boot --> up:replay > Apr 16 20:13:59 strg-node3 bash[59114]: debug -59> > 2025-04-16T14:43:59.254+0000 7f74b5030700 5 > mds.beacon.mummasstrg.strg-node3.xhxbwx set_want_state: up:boot -> up:replay > Apr 16 20:13:59 strg-node3 bash[59114]: debug -58> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 replay_start > *Apr 16 20:13:59 strg-node3 bash[59114]: debug -57> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 waiting for > osdmap 45749 (which blocklists prior instance)* > Apr 16 20:13:59 strg-node3 bash[59114]: debug -56> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message > to mon.strg-node3 at v2:10.0.103.3:3300/0 > *Apr 16 20:13:59 strg-node3 bash[59114]: debug -55> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue operator(): > data pool 3 not found in OSDMap* > On Thu, Apr 17, 2025 at 7:06 AM Amudhan P <amudha...@gmail.com> wrote: > Eugen, > > This is the output for the command > cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue journal > inspect > Overall journal integrity: OK > cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal inspect > Overall journal integrity: OK > > On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag> wrote: > >> I think either your mdlog or the purge_queue journal is corrupted: >> >> 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: >> waiting for purge queue recovered >> Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> >> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request >> con 0x562856a25400 auth_method 0 >> Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> >> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal >> (Segmentation fault) ** >> Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700 >> thread_name:md_log_replay >> >> Can you paste the output of this command? >> >> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue >> journal inspect >> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog journal >> inspect >> >> I expect one or more damaged entries. Check this thread for more details: >> >> https://www.spinics.net/lists/ceph-users/msg80124.html >> >> You should try to backup the journal, but in my case that wasn't >> possible, so I had no other choice than resetting it. >> >> Zitat von Amudhan P <amudha...@gmail.com>: >> >> > Hi, >> > >> > I am having 2 problems with my Ceph version 16.2.6 >> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) deployed >> thru >> > cephadm. >> > >> > First issue :- >> > 1 out 3 mon service went out of quorum . >> > When restarted service it comes normal but after a few minutes in ceph >> > watch log it reports slow ops and mon goes out of quorum. >> > Node where this mon service failed had one weird thing that I could see >> 40% >> > of wait in the top command. But I don't see any error in dmesg or >> anything >> > related to drive IO error. >> > Below are the logs that were printed in ceph watch command. >> > >> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] MON_DOWN: 1/3 >> > mons down, quorum strg-node2,strg-node3 >> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN] mon.strg-node1 >> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down (out >> of >> > quorum) >> > >> > For now this is not appearing again. >> > >> > >> > Second issue Cephfs degraded :- >> > I have 2 MDS services running in 2 different nodes. Both are in a >> stopped >> > state. >> > when running Ceph -s command >> > >> > cluster: >> > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a >> > health: HEALTH_WARN >> > 2 failed cephadm daemon(s) >> > 1 filesystem is degraded >> > insufficient standby MDS daemons available >> > >> > services: >> > mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 4h) >> > mgr: strg-node2.unyimy(active, since 4h), standbys: >> strg-node1.ivkfid >> > mds: 1/1 daemons up >> > osd: 32 osds: 32 up (since 4h), 32 in (since 10w) >> > >> > data: >> > volumes: 0/1 healthy, 1 recovering >> > pools: 3 pools, 321 pgs >> > objects: 15.49M objects, 54 TiB >> > usage: 109 TiB used, 66 TiB / 175 TiB avail >> > pgs: 321 active+clean >> > >> > Volume shows recovering but there wasn't any progress till now even >> manual >> > start mds service fails again. In Ceph -s command under services it >> shows >> > mds up no any mds service is running. >> > >> > Below is a log snip from one of the mds service. >> > >> > >> > -25> 2025-04-16T09:59:29.954+0000 7f43d0874700 1 >> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320, ex> >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -24> >> > 2025-04-16T09:59:29.954+0000 7f43d0874700 1 mds.0.journaler.pq(ro) >> probing >> > for end of the log >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -23> >> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient: get_auth_request >> > con 0x562856a17400 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -22> >> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient: get_auth_request >> > con 0x562856a17c00 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -21> >> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 mds.0.journaler.mdlog(ro) >> > recover start >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -20> >> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 mds.0.journaler.mdlog(ro) >> > read_head >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -19> >> > 2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log Waiting for >> journal >> > 0x200 to recover... >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -18> >> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient: get_auth_request >> > con 0x562856a25000 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -17> >> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro) >> > _finish_probe_end write_pos = 13968309289 (hea> >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -16> >> > 2025-04-16T09:59:29.998+0000 7f43d0874700 4 mds.0.purge_queue >> operator(): >> > open complete >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -15> >> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro) >> > set_writeable >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -14> >> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro) >> > _finish_read_head loghead(trim 189741504921> >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -13> >> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro) >> > probing for end of the log >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -12> >> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient: get_auth_request >> > con 0x562856a25c00 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -11> >> > 2025-04-16T09:59:30.098+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro) >> > _finish_probe_end write_pos = 1897428915052> >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -10> >> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Journal 0x200 >> > recovered. >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -9> >> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Recovered journal >> > 0x200 in format 1 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -8> >> > 2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506 Booting: 1: >> > loading/discovering base inodes >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -7> >> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating system >> > inode with ino:0x100 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -6> >> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating system >> > inode with ino:0x1 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -5> >> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient: get_auth_request >> > con 0x562856a25800 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -4> >> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient: get_auth_request >> > con 0x562856a5dc00 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -3> >> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: >> > replaying mds log >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -2> >> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: >> > waiting for purge queue recovered >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> >> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request >> > con 0x562856a25400 auth_method 0 >> > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> >> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal >> > (Segmentation fault) ** >> > Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700 >> > thread_name:md_log_replay >> > Apr 16 15:29:30 strg-node4 bash[7566]: ceph version 16.2.6 >> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) >> > Apr 16 15:29:30 strg-node4 bash[7566]: 1: >> /lib64/libpthread.so.0(+0x12b20) >> > [0x7f43dd293b20] >> > Apr 16 15:29:30 strg-node4 bash[7566]: 2: >> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00] >> > Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of the executable, >> or >> > `objdump -rdS <executable>` is needed to interpret this. >> > >> > >> > Not sure what caused the issue. I couldn't find any resources to fix >> this >> > issue. >> > Need help from someone to bring the ceph cluster online. >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io