[ceph-users] Re: ceph mon out of quorom and cephfs degraded

Amudhan P Thu, 17 Apr 2025 07:42:20 -0700

Okay Thanks, Eugen.

On Thu, Apr 17, 2025 at 6:49 PM Eugen Block <ebl...@nde.ag> wrote:


> Those tools are specifically for a disaster recovery, meaning there's
> no mds running (otherwise you wouldn't need them ;-) ).
>
> >> What would this command do ?
> >> ceph fs reset-mds filesystem --yes-i-really-mean-it
>
> I can only quote the docs, I don't know what exactly happens in the
> background, I guess the devs would need to answer that. But maybe you
> don't even need it, sometimes it's sufficient to reset the journal.
>
> A few months ago I asked to improve the disaster recovery docs since
> it's not really clear what exactly the described steps do and how
> dangerous which operation is. Zac created a tracker for this, but I
> haven't checked in a while for any progress. But most of the times (at
> least that's my perception) the recovery seems to work.
>
> Zitat von Amudhan P <amudha...@gmail.com>:
>
> > Also, I don't have any active mds running, not sure if I can do all the
> > recovery steps.
> >
> > On Thu, Apr 17, 2025 at 5:42 PM Amudhan P <amudha...@gmail.com> wrote:
> >
> >> I didn't face any network related issues but in my email start. I have
> >> mentioned about one of the mon node was out of quorum.
> >> Even after restart of mon node had few issues in log like slow ops in
> mon
> >> node and eventually it cleared but same mon node went out of quorum and
> >> comes back online. During the above said issue observed 40% cpu in wait
> in
> >> top command in the same node where mon had issue. But there is no errors
> >> logged in dmesg related to drive error or network.  Only after this mon
> >> error may be after 24 hours MDS failed. MON issue cleared now after all
> >> nodes were reboot cycled.
> >>
> >> Yeah, disaster recovery will be performed only as last resort.
> >>
> >> What would this command do ?
> >>
> >> ceph fs reset-mds filesystem --yes-i-really-mean-it
> >>
> >>
> >> On Thu, Apr 17, 2025 at 5:19 PM Eugen Block <ebl...@nde.ag> wrote:
> >>
> >>> Was there any issue in your network or anything? Something that would
> >>> explain the MDS crash? I would recommend to scan syslog, dmesg etc.
> >>> for anything suspicious. If you don't find anything, you might need to
> >>> go through
> >>> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ to
> >>> get your CephFS back up.
> >>>
> >>> Don't forget to create a backup of the journals. Read the instructions
> >>> carefully, you might not need to do all the mentioned steps. I would
> >>> start with (for both journals where applicable):
> >>>
> >>> cephfs-journal-tool journal export backup.bin
> >>> cephfs-journal-tool event recover_dentries summary
> >>> cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal reset
> >>> --yes-i-really-really-mean-it
> >>> cephfs-table-tool all reset session
> >>>
> >>> But again, read the instructions carefully. This procedure is usually
> >>> the last resort, I would first try to find out what happened causing
> >>> this situation.
> >>>
> >>> Zitat von Amudhan P <amudha...@gmail.com>:
> >>>
> >>> > I don't think I have a memory issue.
> >>> > Sorry, for sending log as file Pastebin is not working due to size
> >>> > limitation.
> >>> >
> >>> > Crash info
> >>> >
> >>> >> ceph crash info
> >>> >> 2025-04-17T08:50:35.931485Z_0109f173-206c-471e-afac-c0d3e0aa2785
> >>> >> {
> >>> >>     "backtrace": [
> >>> >>         "/lib64/libpthread.so.0(+0x12b20) [0x7fbd42bbdb20]",
> >>> >>         "/usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00)
> >>> [0x7fbd4c81da00]"
> >>> >>     ],
> >>> >>     "ceph_version": "16.2.6",
> >>> >>     "crash_id":
> >>> >> "2025-04-17T08:50:35.931485Z_0109f173-206c-471e-afac-c0d3e0aa2785",
> >>> >>     "entity_name": "mds.mummasstrg.strg-node1.gchapr",
> >>> >>     "os_id": "centos",
> >>> >>     "os_name": "CentOS Linux",
> >>> >>     "os_version": "8",
> >>> >>     "os_version_id": "8",
> >>> >>     "process_name": "ceph-mds",
> >>> >>     "stack_sig":
> >>> >> "5238fe1be0b82b479b10ee7d17b5ad3182cdc93ede581af63d627a472a4fcf9e",
> >>> >>     "timestamp": "2025-04-17T08:50:35.931485Z",
> >>> >>     "utsname_hostname": "strg-node1",
> >>> >>     "utsname_machine": "x86_64",
> >>> >>     "utsname_release": "5.10.0-8-amd64",
> >>> >>     "utsname_sysname": "Linux",
> >>> >>     "utsname_version": "#1 SMP Debian 5.10.46-5 (2021-09-23)"
> >>> >> }
> >>> >>
> >>> >
> >>> > On Thu, Apr 17, 2025 at 2:55 PM Eugen Block <ebl...@nde.ag> wrote:
> >>> >
> >>> >> Oh right, I just noticed you had the status in your first message.
> Is
> >>> >> there any chance the MDS is out of memory? Can you paste a complete
> >>> >> (fresh) startup log on some pastebin or so? Also, a complete 'ceph
> >>> >> crash info <crash>' could be useful.
> >>> >>
> >>> >> Zitat von Amudhan P <amudha...@gmail.com>:
> >>> >>
> >>> >> > I have run the status and stat command below is the output.
> >>> >> >
> >>> >> > ceph -s
> >>> >> >
> >>> >> > cluster:
> >>> >> >     id:     7b3a4952-2131-11ec-94ce-0cc47a5ec98a
> >>> >> >     health: HEALTH_WARN
> >>> >> >             2 failed cephadm daemon(s)
> >>> >> >             1 filesystem is degraded
> >>> >> >             insufficient standby MDS daemons available
> >>> >> >             7 daemons have recently crashed
> >>> >> >
> >>> >> >   services:
> >>> >> >     mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age
> 20h)
> >>> >> >     mgr: strg-node2.unyimy(active, since 20h), standbys:
> >>> >> strg-node1.ivkfid
> >>> >> >     mds: 1/1 daemons up
> >>> >> >     osd: 32 osds: 32 up (since 20h), 32 in (since 10w)
> >>> >> >
> >>> >> >   data:
> >>> >> >     volumes: 0/1 healthy, 1 recovering
> >>> >> >     pools:   3 pools, 321 pgs
> >>> >> >     objects: 15.49M objects, 54 TiB
> >>> >> >     usage:   109 TiB used, 66 TiB / 175 TiB avail
> >>> >> >     pgs:     317 active+clean
> >>> >> >              4   active+clean+scrubbing+deep
> >>> >> >
> >>> >> >
> >>> >> > ceph mds stat
> >>> >> > mumstrg:1/1 {0=mumstrg.strg-node1.gchapr=up:replay(laggy or
> crashed)}
> >>> >> >
> >>> >> > ceph osd lspools
> >>> >> > 1 device_health_metrics
> >>> >> > 2 cephfs.mumstrg.meta
> >>> >> > 3 cephfs.mumstrg.data
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Thu, Apr 17, 2025 at 10:33 AM Eugen Block <ebl...@nde.ag>
> wrote:
> >>> >> >
> >>> >> >> What’s your overall Ceph status? It says data pool 3 not found.
> >>> >> >>
> >>> >> >> Zitat von Amudhan P <amudha...@gmail.com>:
> >>> >> >>
> >>> >> >> > There are few more logs in MDS. I have highlighted few log
> lines
> >>> >> which I
> >>> >> >> am
> >>> >> >> > not sure what it is.
> >>> >> >> >
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -79>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780  5
> asok(0x560a2c44e000)
> >>> >> >> > register_command dump inode hook 0x560a2c354580
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -78>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780  5
> asok(0x560a2c44e000)
> >>> >> >> > register_command exit hook 0x560a2c354580
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -77>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780  5
> asok(0x560a2c44e000)
> >>> >> >> > register_command respawn hook 0x560a2c354580
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -76>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780  5
> asok(0x560a2c44e000)
> >>> >> >> > register_command heap hook 0x560a2c354580
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -75>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74b5030700  1
> >>> >> >> mds.mumstrg.strg-node3.xhxbwx
> >>> >> >> > Updating MDS map to version 127517 f
> >>> >> >> > rom mon.2
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -74>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780  5
> asok(0x560a2c44e000)
> >>> >> >> > register_command cpu_profiler hook 0x560a2c35458
> >>> >> >> >
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -73>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74b302c700  5
> >>> >> >> > mds.beacon.mumstrg.strg-node3.xhxbwx Sending beacon up:boot
> seq 1
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -72>
> >>> >> >> > 2025-04-16T14:43:59.170+0000 7f74b302c700 10 monclient:
> >>> >> _send_mon_message
> >>> >> >> > to mon.strg-node3 at v2:10.0.103.3:3300/
> >>> >> >> >
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -71>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700  1
> >>> >> >> mds.mumstrg.strg-node3.xhxbwx
> >>> >> >> > Updating MDS map to version 127518 f
> >>> >> >> > rom mon.2
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -70>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
> >>> _renew_subs
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -69>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
> >>> >> _send_mon_message
> >>> >> >> > to mon.strg-node3 at v2:10.0.103.3:3300/
> >>> >> >> >
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -68>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700  4 mds.0.purge_queue
> >>> >> >> operator():
> >>> >> >> >  data pool 3 not found in OSDMap
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -67>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700  5
> asok(0x560a2c44e000)
> >>> >> >> > register_command objecter_requests hook 0x560a2c
> >>> >> >> > 3544c0
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -66>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
> >>> _renew_subs
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -65>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
> >>> >> _send_mon_message
> >>> >> >> > to mon.strg-node3 at v2:10.0.103.3:3300/
> >>> >> >> >
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -64>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10
> log_channel(cluster)
> >>> >> >> > update_config to_monitors: true to_syslog: false
> >>> >> >> >  syslog_facility: daemon prio: info to_graylog: false
> >>> graylog_host:
> >>> >> >> > 127.0.0.1 graylog_port: 12201)
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -63>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700  4 mds.0.purge_queue
> >>> >> >> operator():
> >>> >> >> >  data pool 3 not found in OSDMap
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -62>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700  4 mds.0.0
> >>> handle_osd_map
> >>> >> epoch
> >>> >> >> > 0, 0 new blocklist entries
> >>> >> >> > Apr 16 20:13:59 strg-node3 bash[59114]: debug    -61>
> >>> >> >> > 2025-04-16T14:43:59.254+0000 7f74b5030700  1 mds.0.127518
> >>> >> handle_mds_map
> >>> >> >> i
> >>> >> >> > am now mds.0.127518
> >>> >> >> >
> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug    -60>
> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700  1 mds.0.127518
> >>> >> handle_mds_map
> >>> >> >> >> state change up:boot --> up:replay
> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug    -59>
> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700  5
> >>> >> >> >> mds.beacon.mummasstrg.strg-node3.xhxbwx set_want_state:
> up:boot
> >>> ->
> >>> >> >> up:replay
> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug    -58>
> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700  1 mds.0.127518
> >>> >> replay_start
> >>> >> >> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug    -57>
> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700  1 mds.0.127518
> >>> waiting
> >>> >> for
> >>> >> >> >> osdmap 45749 (which blocklists prior instance)*
> >>> >> >> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug    -56>
> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
> >>> >> >> _send_mon_message
> >>> >> >> >> to mon.strg-node3 at v2:10.0.103.3:3300/0
> >>> >> >> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug    -55>
> >>> >> >> >> 2025-04-16T14:43:59.254+0000 7f74b5030700  4 mds.0.purge_queue
> >>> >> >> operator():
> >>> >> >> >>  data pool 3 not found in OSDMap*
> >>> >> >> >>
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > On Thu, Apr 17, 2025 at 7:06 AM Amudhan P <amudha...@gmail.com
> >
> >>> >> wrote:
> >>> >> >> >
> >>> >> >> >> Eugen,
> >>> >> >> >>
> >>> >> >> >> This is the output for the command
> >>> >> >> >> cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue
> >>> journal
> >>> >> >> >> inspect
> >>> >> >> >> Overall journal integrity: OK
> >>> >> >> >> cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal
> >>> >> inspect
> >>> >> >> >> Overall journal integrity: OK
> >>> >> >> >>
> >>> >> >> >> On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag>
> >>> wrote:
> >>> >> >> >>
> >>> >> >> >>> I think either your mdlog or the purge_queue journal is
> >>> corrupted:
> >>> >> >> >>>
> >>> >> >> >>> 2025-04-16T09:59:30.146+0000 7f43cf872700  2 mds.0.127506
> >>> Booting:
> >>> >> 2:
> >>> >> >> >>> waiting for purge queue recovered
> >>> >> >> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug     -1>
> >>> >> >> >>> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> con 0x562856a25400 auth_method 0
> >>> >> >> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug      0>
> >>> >> >> >>> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught
> signal
> >>> >> >> >>> (Segmentation fault) **
> >>> >> >> >>> Apr 16 15:29:30 strg-node4 bash[7566]:  in thread
> 7f43ce06f700
> >>> >> >> >>> thread_name:md_log_replay
> >>> >> >> >>>
> >>> >> >> >>> Can you paste the output of this command?
> >>> >> >> >>>
> >>> >> >> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all
> >>> --journal=purge_queue
> >>> >> >> >>> journal inspect
> >>> >> >> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog
> >>> >> journal
> >>> >> >> >>> inspect
> >>> >> >> >>>
> >>> >> >> >>> I expect one or more damaged entries. Check this thread for
> more
> >>> >> >> details:
> >>> >> >> >>>
> >>> >> >> >>> https://www.spinics.net/lists/ceph-users/msg80124.html
> >>> >> >> >>>
> >>> >> >> >>> You should try to backup the journal, but in my case that
> wasn't
> >>> >> >> >>> possible, so I had no other choice than resetting it.
> >>> >> >> >>>
> >>> >> >> >>> Zitat von Amudhan P <amudha...@gmail.com>:
> >>> >> >> >>>
> >>> >> >> >>> > Hi,
> >>> >> >> >>> >
> >>> >> >> >>> > I am having 2 problems with my Ceph version 16.2.6
> >>> >> >> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
> >>> >> deployed
> >>> >> >> >>> thru
> >>> >> >> >>> > cephadm.
> >>> >> >> >>> >
> >>> >> >> >>> > First issue :-
> >>> >> >> >>> > 1 out 3 mon service went out of quorum .
> >>> >> >> >>> > When restarted service it comes normal but after a few
> >>> minutes in
> >>> >> >> ceph
> >>> >> >> >>> > watch log it reports slow ops and mon goes out of quorum.
> >>> >> >> >>> > Node where this mon service failed had one weird thing
> that I
> >>> >> could
> >>> >> >> see
> >>> >> >> >>> 40%
> >>> >> >> >>> > of wait in the top command. But I don't see any error in
> >>> dmesg or
> >>> >> >> >>> anything
> >>> >> >> >>> > related to drive IO error.
> >>> >> >> >>> > Below are the logs that were printed in ceph watch command.
> >>> >> >> >>> >
> >>> >> >> >>> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN]
> >>> >> MON_DOWN:
> >>> >> >> 1/3
> >>> >> >> >>> > mons down, quorum strg-node2,strg-node3
> >>> >> >> >>> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN]
> >>> >> >>  mon.strg-node1
> >>> >> >> >>> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0]
> is
> >>> down
> >>> >> >> (out
> >>> >> >> >>> of
> >>> >> >> >>> > quorum)
> >>> >> >> >>> >
> >>> >> >> >>> > For now this is not appearing again.
> >>> >> >> >>> >
> >>> >> >> >>> >
> >>> >> >> >>> > Second issue Cephfs degraded  :-
> >>> >> >> >>> > I  have 2 MDS services running in 2 different nodes. Both
> are
> >>> in a
> >>> >> >> >>> stopped
> >>> >> >> >>> > state.
> >>> >> >> >>> > when running Ceph -s command
> >>> >> >> >>> >
> >>> >> >> >>> >   cluster:
> >>> >> >> >>> >     id:     7b3a4952-2131-11ec-94ce-0cc47a5ec98a
> >>> >> >> >>> >     health: HEALTH_WARN
> >>> >> >> >>> >             2 failed cephadm daemon(s)
> >>> >> >> >>> >             1 filesystem is degraded
> >>> >> >> >>> >             insufficient standby MDS daemons available
> >>> >> >> >>> >
> >>> >> >> >>> >   services:
> >>> >> >> >>> >     mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3
> >>> (age
> >>> >> 4h)
> >>> >> >> >>> >     mgr: strg-node2.unyimy(active, since 4h), standbys:
> >>> >> >> >>> strg-node1.ivkfid
> >>> >> >> >>> >     mds: 1/1 daemons up
> >>> >> >> >>> >     osd: 32 osds: 32 up (since 4h), 32 in (since 10w)
> >>> >> >> >>> >
> >>> >> >> >>> >   data:
> >>> >> >> >>> >     volumes: 0/1 healthy, 1 recovering
> >>> >> >> >>> >     pools:   3 pools, 321 pgs
> >>> >> >> >>> >     objects: 15.49M objects, 54 TiB
> >>> >> >> >>> >     usage:   109 TiB used, 66 TiB / 175 TiB avail
> >>> >> >> >>> >     pgs:     321 active+clean
> >>> >> >> >>> >
> >>> >> >> >>> > Volume shows recovering but there wasn't any progress till
> now
> >>> >> even
> >>> >> >> >>> manual
> >>> >> >> >>> > start mds service fails again. In Ceph -s command under
> >>> services
> >>> >> it
> >>> >> >> >>> shows
> >>> >> >> >>> > mds up no any mds service is running.
> >>> >> >> >>> >
> >>> >> >> >>> > Below is a log snip from one of the mds service.
> >>> >> >> >>> >
> >>> >> >> >>> >
> >>> >> >> >>> >             -25> 2025-04-16T09:59:29.954+0000
> 7f43d0874700  1
> >>> >> >> >>> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim
> >>> 13967032320,
> >>> >> >> ex>
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -24>
> >>> >> >> >>> > 2025-04-16T09:59:29.954+0000 7f43d0874700  1
> >>> >> mds.0.journaler.pq(ro)
> >>> >> >> >>> probing
> >>> >> >> >>> > for end of the log
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -23>
> >>> >> >> >>> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a17400 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -22>
> >>> >> >> >>> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a17c00 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -21>
> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700  1
> >>> >> >> mds.0.journaler.mdlog(ro)
> >>> >> >> >>> > recover start
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -20>
> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700  1
> >>> >> >> mds.0.journaler.mdlog(ro)
> >>> >> >> >>> > read_head
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -19>
> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700  4 mds.0.log
> >>> Waiting for
> >>> >> >> >>> journal
> >>> >> >> >>> > 0x200 to recover...
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -18>
> >>> >> >> >>> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a25000 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -17>
> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700  1
> >>> >> mds.0.journaler.pq(ro)
> >>> >> >> >>> > _finish_probe_end write_pos = 13968309289 (hea>
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -16>
> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700  4
> mds.0.purge_queue
> >>> >> >> >>> operator():
> >>> >> >> >>> > open complete
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -15>
> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700  1
> >>> >> mds.0.journaler.pq(ro)
> >>> >> >> >>> > set_writeable
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -14>
> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700  1
> >>> >> >> mds.0.journaler.mdlog(ro)
> >>> >> >> >>> > _finish_read_head loghead(trim 189741504921>
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -13>
> >>> >> >> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700  1
> >>> >> >> mds.0.journaler.mdlog(ro)
> >>> >> >> >>> > probing for end of the log
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -12>
> >>> >> >> >>> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a25c00 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -11>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf872700  1
> >>> >> >> mds.0.journaler.mdlog(ro)
> >>> >> >> >>> > _finish_probe_end write_pos = 1897428915052>
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug    -10>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700  4 mds.0.log
> Journal
> >>> >> 0x200
> >>> >> >> >>> > recovered.
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -9>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700  4 mds.0.log
> >>> Recovered
> >>> >> >> journal
> >>> >> >> >>> > 0x200 in format 1
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -8>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700  2 mds.0.127506
> >>> >> Booting: 1:
> >>> >> >> >>> > loading/discovering base inodes
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -7>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700  0 mds.0.cache
> >>> creating
> >>> >> >> system
> >>> >> >> >>> > inode with ino:0x100
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -6>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700  0 mds.0.cache
> >>> creating
> >>> >> >> system
> >>> >> >> >>> > inode with ino:0x1
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -5>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a25800 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -4>
> >>> >> >> >>> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a5dc00 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -3>
> >>> >> >> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700  2 mds.0.127506
> >>> >> Booting: 2:
> >>> >> >> >>> > replaying mds log
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -2>
> >>> >> >> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700  2 mds.0.127506
> >>> >> Booting: 2:
> >>> >> >> >>> > waiting for purge queue recovered
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug     -1>
> >>> >> >> >>> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient:
> >>> >> >> get_auth_request
> >>> >> >> >>> > con 0x562856a25400 auth_method 0
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug      0>
> >>> >> >> >>> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught
> signal
> >>> >> >> >>> > (Segmentation fault) **
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]:  in thread
> 7f43ce06f700
> >>> >> >> >>> > thread_name:md_log_replay
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]:  ceph version 16.2.6
> >>> >> >> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]:  1:
> >>> >> >> >>> /lib64/libpthread.so.0(+0x12b20)
> >>> >> >> >>> > [0x7f43dd293b20]
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]:  2:
> >>> >> >> >>> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00)
> >>> [0x7f43e6ef3a00]
> >>> >> >> >>> > Apr 16 15:29:30 strg-node4 bash[7566]:  NOTE: a copy of the
> >>> >> >> executable,
> >>> >> >> >>> or
> >>> >> >> >>> > `objdump -rdS <executable>` is needed to interpret this.
> >>> >> >> >>> >
> >>> >> >> >>> >
> >>> >> >> >>> > Not sure what caused the issue. I couldn't find any
> resources
> >>> to
> >>> >> fix
> >>> >> >> >>> this
> >>> >> >> >>> > issue.
> >>> >> >> >>> > Need help from someone to bring the ceph cluster online.
> >>> >> >> >>> > _______________________________________________
> >>> >> >> >>> > ceph-users mailing list -- ceph-users@ceph.io
> >>> >> >> >>> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>> >> >> >>>
> >>> >> >> >>>
> >>> >> >> >>> _______________________________________________
> >>> >> >> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> >> >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>> >> >> >>>
> >>> >> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>>
> >>>
> >>>
> >>>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph mon out of quorom and cephfs degraded

Reply via email to