I think either your mdlog or the purge_queue journal is corrupted:
2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2:
waiting for purge queue recovered
Apr 16 15:29:30 strg-node4 bash[7566]: debug -1>
2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request
con 0x562856a25400 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug 0>
2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
(Segmentation fault) **
Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700
thread_name:md_log_replay
Can you paste the output of this command?
cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue
journal inspect
cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog journal inspect
I expect one or more damaged entries. Check this thread for more details:
https://www.spinics.net/lists/ceph-users/msg80124.html
You should try to backup the journal, but in my case that wasn't
possible, so I had no other choice than resetting it.
Zitat von Amudhan P <amudha...@gmail.com>:
Hi,
I am having 2 problems with my Ceph version 16.2.6
(ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) deployed thru
cephadm.
First issue :-
1 out 3 mon service went out of quorum .
When restarted service it comes normal but after a few minutes in ceph
watch log it reports slow ops and mon goes out of quorum.
Node where this mon service failed had one weird thing that I could see 40%
of wait in the top command. But I don't see any error in dmesg or anything
related to drive IO error.
Below are the logs that were printed in ceph watch command.
2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] MON_DOWN: 1/3
mons down, quorum strg-node2,strg-node3
2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN] mon.strg-node1
(rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down (out of
quorum)
For now this is not appearing again.
Second issue Cephfs degraded :-
I have 2 MDS services running in 2 different nodes. Both are in a stopped
state.
when running Ceph -s command
cluster:
id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a
health: HEALTH_WARN
2 failed cephadm daemon(s)
1 filesystem is degraded
insufficient standby MDS daemons available
services:
mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 4h)
mgr: strg-node2.unyimy(active, since 4h), standbys: strg-node1.ivkfid
mds: 1/1 daemons up
osd: 32 osds: 32 up (since 4h), 32 in (since 10w)
data:
volumes: 0/1 healthy, 1 recovering
pools: 3 pools, 321 pgs
objects: 15.49M objects, 54 TiB
usage: 109 TiB used, 66 TiB / 175 TiB avail
pgs: 321 active+clean
Volume shows recovering but there wasn't any progress till now even manual
start mds service fails again. In Ceph -s command under services it shows
mds up no any mds service is running.
Below is a log snip from one of the mds service.
-25> 2025-04-16T09:59:29.954+0000 7f43d0874700 1
mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320, ex>
Apr 16 15:29:30 strg-node4 bash[7566]: debug -24>
2025-04-16T09:59:29.954+0000 7f43d0874700 1 mds.0.journaler.pq(ro) probing
for end of the log
Apr 16 15:29:30 strg-node4 bash[7566]: debug -23>
2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient: get_auth_request
con 0x562856a17400 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug -22>
2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient: get_auth_request
con 0x562856a17c00 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug -21>
2025-04-16T09:59:29.974+0000 7f43cf071700 1 mds.0.journaler.mdlog(ro)
recover start
Apr 16 15:29:30 strg-node4 bash[7566]: debug -20>
2025-04-16T09:59:29.974+0000 7f43cf071700 1 mds.0.journaler.mdlog(ro)
read_head
Apr 16 15:29:30 strg-node4 bash[7566]: debug -19>
2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log Waiting for journal
0x200 to recover...
Apr 16 15:29:30 strg-node4 bash[7566]: debug -18>
2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient: get_auth_request
con 0x562856a25000 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug -17>
2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro)
_finish_probe_end write_pos = 13968309289 (hea>
Apr 16 15:29:30 strg-node4 bash[7566]: debug -16>
2025-04-16T09:59:29.998+0000 7f43d0874700 4 mds.0.purge_queue operator():
open complete
Apr 16 15:29:30 strg-node4 bash[7566]: debug -15>
2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro)
set_writeable
Apr 16 15:29:30 strg-node4 bash[7566]: debug -14>
2025-04-16T09:59:29.998+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro)
_finish_read_head loghead(trim 189741504921>
Apr 16 15:29:30 strg-node4 bash[7566]: debug -13>
2025-04-16T09:59:29.998+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro)
probing for end of the log
Apr 16 15:29:30 strg-node4 bash[7566]: debug -12>
2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient: get_auth_request
con 0x562856a25c00 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug -11>
2025-04-16T09:59:30.098+0000 7f43cf872700 1 mds.0.journaler.mdlog(ro)
_finish_probe_end write_pos = 1897428915052>
Apr 16 15:29:30 strg-node4 bash[7566]: debug -10>
2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Journal 0x200
recovered.
Apr 16 15:29:30 strg-node4 bash[7566]: debug -9>
2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Recovered journal
0x200 in format 1
Apr 16 15:29:30 strg-node4 bash[7566]: debug -8>
2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506 Booting: 1:
loading/discovering base inodes
Apr 16 15:29:30 strg-node4 bash[7566]: debug -7>
2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating system
inode with ino:0x100
Apr 16 15:29:30 strg-node4 bash[7566]: debug -6>
2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating system
inode with ino:0x1
Apr 16 15:29:30 strg-node4 bash[7566]: debug -5>
2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient: get_auth_request
con 0x562856a25800 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug -4>
2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient: get_auth_request
con 0x562856a5dc00 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug -3>
2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2:
replaying mds log
Apr 16 15:29:30 strg-node4 bash[7566]: debug -2>
2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2:
waiting for purge queue recovered
Apr 16 15:29:30 strg-node4 bash[7566]: debug -1>
2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request
con 0x562856a25400 auth_method 0
Apr 16 15:29:30 strg-node4 bash[7566]: debug 0>
2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
(Segmentation fault) **
Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700
thread_name:md_log_replay
Apr 16 15:29:30 strg-node4 bash[7566]: ceph version 16.2.6
(ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
Apr 16 15:29:30 strg-node4 bash[7566]: 1: /lib64/libpthread.so.0(+0x12b20)
[0x7f43dd293b20]
Apr 16 15:29:30 strg-node4 bash[7566]: 2:
/usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00]
Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of the executable, or
`objdump -rdS <executable>` is needed to interpret this.
Not sure what caused the issue. I couldn't find any resources to fix this
issue.
Need help from someone to bring the ceph cluster online.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io