[ceph-users] Re: ceph mon out of quorom and cephfs degraded

Amudhan P Wed, 16 Apr 2025 18:37:26 -0700

Eugen,

This is the output for the command
cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue journal inspect
Overall journal integrity: OK
cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal inspect
Overall journal integrity: OK


On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag> wrote:

> I think either your mdlog or the purge_queue journal is corrupted:
>
> 2025-04-16T09:59:30.146+0000 7f43cf872700  2 mds.0.127506 Booting: 2:
> waiting for purge queue recovered
> Apr 16 15:29:30 strg-node4 bash[7566]: debug     -1>
> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request
> con 0x562856a25400 auth_method 0
> Apr 16 15:29:30 strg-node4 bash[7566]: debug      0>
> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
> (Segmentation fault) **
> Apr 16 15:29:30 strg-node4 bash[7566]:  in thread 7f43ce06f700
> thread_name:md_log_replay
>
> Can you paste the output of this command?
>
> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue
> journal inspect
> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog journal
> inspect
>
> I expect one or more damaged entries. Check this thread for more details:
>
> https://www.spinics.net/lists/ceph-users/msg80124.html
>
> You should try to backup the journal, but in my case that wasn't
> possible, so I had no other choice than resetting it.
>
> Zitat von Amudhan P <amudha...@gmail.com>:
>
> > Hi,
> >
> > I am having 2 problems with my Ceph version 16.2.6
> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) deployed thru
> > cephadm.
> >
> > First issue :-
> > 1 out 3 mon service went out of quorum .
> > When restarted service it comes normal but after a few minutes in ceph
> > watch log it reports slow ops and mon goes out of quorum.
> > Node where this mon service failed had one weird thing that I could see
> 40%
> > of wait in the top command. But I don't see any error in dmesg or
> anything
> > related to drive IO error.
> > Below are the logs that were printed in ceph watch command.
> >
> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] MON_DOWN: 1/3
> > mons down, quorum strg-node2,strg-node3
> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN]     mon.strg-node1
> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down (out
> of
> > quorum)
> >
> > For now this is not appearing again.
> >
> >
> > Second issue Cephfs degraded  :-
> > I  have 2 MDS services running in 2 different nodes. Both are in a
> stopped
> > state.
> > when running Ceph -s command
> >
> >   cluster:
> >     id:     7b3a4952-2131-11ec-94ce-0cc47a5ec98a
> >     health: HEALTH_WARN
> >             2 failed cephadm daemon(s)
> >             1 filesystem is degraded
> >             insufficient standby MDS daemons available
> >
> >   services:
> >     mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 4h)
> >     mgr: strg-node2.unyimy(active, since 4h), standbys: strg-node1.ivkfid
> >     mds: 1/1 daemons up
> >     osd: 32 osds: 32 up (since 4h), 32 in (since 10w)
> >
> >   data:
> >     volumes: 0/1 healthy, 1 recovering
> >     pools:   3 pools, 321 pgs
> >     objects: 15.49M objects, 54 TiB
> >     usage:   109 TiB used, 66 TiB / 175 TiB avail
> >     pgs:     321 active+clean
> >
> > Volume shows recovering but there wasn't any progress till now even
> manual
> > start mds service fails again. In Ceph -s command under services it shows
> > mds up no any mds service is running.
> >
> > Below is a log snip from one of the mds service.
> >
> >
> >             -25> 2025-04-16T09:59:29.954+0000 7f43d0874700  1
> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320, ex>
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -24>
> > 2025-04-16T09:59:29.954+0000 7f43d0874700  1 mds.0.journaler.pq(ro)
> probing
> > for end of the log
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -23>
> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient: get_auth_request
> > con 0x562856a17400 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -22>
> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient: get_auth_request
> > con 0x562856a17c00 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -21>
> > 2025-04-16T09:59:29.974+0000 7f43cf071700  1 mds.0.journaler.mdlog(ro)
> > recover start
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -20>
> > 2025-04-16T09:59:29.974+0000 7f43cf071700  1 mds.0.journaler.mdlog(ro)
> > read_head
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -19>
> > 2025-04-16T09:59:29.974+0000 7f43cf071700  4 mds.0.log Waiting for
> journal
> > 0x200 to recover...
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -18>
> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient: get_auth_request
> > con 0x562856a25000 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -17>
> > 2025-04-16T09:59:29.998+0000 7f43d0874700  1 mds.0.journaler.pq(ro)
> > _finish_probe_end write_pos = 13968309289 (hea>
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -16>
> > 2025-04-16T09:59:29.998+0000 7f43d0874700  4 mds.0.purge_queue
> operator():
> > open complete
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -15>
> > 2025-04-16T09:59:29.998+0000 7f43d0874700  1 mds.0.journaler.pq(ro)
> > set_writeable
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -14>
> > 2025-04-16T09:59:29.998+0000 7f43cf872700  1 mds.0.journaler.mdlog(ro)
> > _finish_read_head loghead(trim 189741504921>
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -13>
> > 2025-04-16T09:59:29.998+0000 7f43cf872700  1 mds.0.journaler.mdlog(ro)
> > probing for end of the log
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -12>
> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient: get_auth_request
> > con 0x562856a25c00 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -11>
> > 2025-04-16T09:59:30.098+0000 7f43cf872700  1 mds.0.journaler.mdlog(ro)
> > _finish_probe_end write_pos = 1897428915052>
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -10>
> > 2025-04-16T09:59:30.098+0000 7f43cf071700  4 mds.0.log Journal 0x200
> > recovered.
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -9>
> > 2025-04-16T09:59:30.098+0000 7f43cf071700  4 mds.0.log Recovered journal
> > 0x200 in format 1
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -8>
> > 2025-04-16T09:59:30.098+0000 7f43cf071700  2 mds.0.127506 Booting: 1:
> > loading/discovering base inodes
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -7>
> > 2025-04-16T09:59:30.098+0000 7f43cf071700  0 mds.0.cache creating system
> > inode with ino:0x100
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -6>
> > 2025-04-16T09:59:30.098+0000 7f43cf071700  0 mds.0.cache creating system
> > inode with ino:0x1
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -5>
> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient: get_auth_request
> > con 0x562856a25800 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -4>
> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient: get_auth_request
> > con 0x562856a5dc00 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -3>
> > 2025-04-16T09:59:30.146+0000 7f43cf872700  2 mds.0.127506 Booting: 2:
> > replaying mds log
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -2>
> > 2025-04-16T09:59:30.146+0000 7f43cf872700  2 mds.0.127506 Booting: 2:
> > waiting for purge queue recovered
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -1>
> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: get_auth_request
> > con 0x562856a25400 auth_method 0
> > Apr 16 15:29:30 strg-node4 bash[7566]: debug      0>
> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
> > (Segmentation fault) **
> > Apr 16 15:29:30 strg-node4 bash[7566]:  in thread 7f43ce06f700
> > thread_name:md_log_replay
> > Apr 16 15:29:30 strg-node4 bash[7566]:  ceph version 16.2.6
> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
> > Apr 16 15:29:30 strg-node4 bash[7566]:  1:
> /lib64/libpthread.so.0(+0x12b20)
> > [0x7f43dd293b20]
> > Apr 16 15:29:30 strg-node4 bash[7566]:  2:
> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00]
> > Apr 16 15:29:30 strg-node4 bash[7566]:  NOTE: a copy of the executable,
> or
> > `objdump -rdS <executable>` is needed to interpret this.
> >
> >
> > Not sure what caused the issue. I couldn't find any resources to fix this
> > issue.
> > Need help from someone to bring the ceph cluster online.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph mon out of quorom and cephfs degraded

Reply via email to