Eugen, This is the output for the command cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue journal inspect Overall journal integrity: OK cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal inspect Overall journal integrity: OK
On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag> wrote: > I think either your mdlog or the purge_queue journal is corrupted: > > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: > waiting for purge queue recovered > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request > con 0x562856a25400 auth_method 0 > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal > (Segmentation fault) ** > Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700 > thread_name:md_log_replay > > Can you paste the output of this command? > > cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue > journal inspect > cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog journal > inspect > > I expect one or more damaged entries. Check this thread for more details: > > https://www.spinics.net/lists/ceph-users/msg80124.html > > You should try to backup the journal, but in my case that wasn't > possible, so I had no other choice than resetting it. > > Zitat von Amudhan P <amudha...@gmail.com>: > > > Hi, > > > > I am having 2 problems with my Ceph version 16.2.6 > > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) deployed thru > > cephadm. > > > > First issue :- > > 1 out 3 mon service went out of quorum . > > When restarted service it comes normal but after a few minutes in ceph > > watch log it reports slow ops and mon goes out of quorum. > > Node where this mon service failed had one weird thing that I could see > 40% > > of wait in the top command. But I don't see any error in dmesg or > anything > > related to drive IO error. > > Below are the logs that were printed in ceph watch command. > > > > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] MON_DOWN: 1/3 > > mons down, quorum strg-node2,strg-node3 > > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN] mon.strg-node1 > > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down (out > of > > quorum) > > > > For now this is not appearing again. > > > > > > Second issue Cephfs degraded :- > > I have 2 MDS services running in 2 different nodes. Both are in a > stopped > > state. > > when running Ceph -s command > > > > cluster: > > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a > > health: HEALTH_WARN > > 2 failed cephadm daemon(s) > > 1 filesystem is degraded > > insufficient standby MDS daemons available > > > > services: > > mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 4h) > > mgr: strg-node2.unyimy(active, since 4h), standbys: strg-node1.ivkfid > > mds: 1/1 daemons up > > osd: 32 osds: 32 up (since 4h), 32 in (since 10w) > > > > data: > > volumes: 0/1 healthy, 1 recovering > > pools: 3 pools, 321 pgs > > objects: 15.49M objects, 54 TiB > > usage: 109 TiB used, 66 TiB / 175 TiB avail > > pgs: 321 active+clean > > > > Volume shows recovering but there wasn't any progress till now even > manual > > start mds service fails again. In Ceph -s command under services it shows > > mds up no any mds service is running. > > > > Below is a log snip from one of the mds service. > > > > > > -25> 2025-04-16T09:59:29.954+0000 7f43d0874700 1 > > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320, ex> > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -24> > > 2025-04-16T09:59:29.954+0000 7f43d0874700 1 mds.0.journaler.pq(ro) > probing > > for end of the log > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -23> > > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient: get_auth_request > > con 0x562856a17400 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -22> > > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient: get_auth_request > > con 0x562856a17c00 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -21> > > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 mds.0.journaler.mdlog(ro) > > recover start > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -20> > > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 mds.0.journaler.mdlog(ro) > > read_head > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -19> > > 2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log Waiting for > journal > > 0x200 to recover... > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -18> > > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient: get_auth_request > > con 0x562856a25000 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -17> > > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro) > > _finish_probe_end write_pos = 13968309289 (hea> > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -16> > > 2025-04-16T09:59:29.998+0000 7f43d0874700 4 mds.0.purge_queue > operator(): > > open complete > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -15> > > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro) > > set_writeable > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -14> > > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro) > > _finish_read_head loghead(trim 189741504921> > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -13> > > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro) > > probing for end of the log > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -12> > > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient: get_auth_request > > con 0x562856a25c00 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -11> > > 2025-04-16T09:59:30.098+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro) > > _finish_probe_end write_pos = 1897428915052> > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -10> > > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Journal 0x200 > > recovered. > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -9> > > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Recovered journal > > 0x200 in format 1 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -8> > > 2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506 Booting: 1: > > loading/discovering base inodes > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -7> > > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating system > > inode with ino:0x100 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -6> > > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating system > > inode with ino:0x1 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -5> > > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient: get_auth_request > > con 0x562856a25800 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -4> > > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient: get_auth_request > > con 0x562856a5dc00 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -3> > > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: > > replaying mds log > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -2> > > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: > > waiting for purge queue recovered > > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> > > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request > > con 0x562856a25400 auth_method 0 > > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> > > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal > > (Segmentation fault) ** > > Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700 > > thread_name:md_log_replay > > Apr 16 15:29:30 strg-node4 bash[7566]: ceph version 16.2.6 > > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) > > Apr 16 15:29:30 strg-node4 bash[7566]: 1: > /lib64/libpthread.so.0(+0x12b20) > > [0x7f43dd293b20] > > Apr 16 15:29:30 strg-node4 bash[7566]: 2: > > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00] > > Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of the executable, > or > > `objdump -rdS <executable>` is needed to interpret this. > > > > > > Not sure what caused the issue. I couldn't find any resources to fix this > > issue. > > Need help from someone to bring the ceph cluster online. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io