ceph crash info
2025-04-17T08:50:35.931485Z_0109f173-206c-471e-afac-c0d3e0aa2785
{
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7fbd42bbdb20]",
"/usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7fbd4c81da00]"
],
"ceph_version": "16.2.6",
"crash_id":
"2025-04-17T08:50:35.931485Z_0109f173-206c-471e-afac-c0d3e0aa2785",
"entity_name": "mds.mummasstrg.strg-node1.gchapr",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mds",
"stack_sig":
"5238fe1be0b82b479b10ee7d17b5ad3182cdc93ede581af63d627a472a4fcf9e",
"timestamp": "2025-04-17T08:50:35.931485Z",
"utsname_hostname": "strg-node1",
"utsname_machine": "x86_64",
"utsname_release": "5.10.0-8-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 5.10.46-5 (2021-09-23)"
}
Oh right, I just noticed you had the status in your first message. Is
there any chance the MDS is out of memory? Can you paste a complete
(fresh) startup log on some pastebin or so? Also, a complete 'ceph
crash info <crash>' could be useful.
Zitat von Amudhan P <amudha...@gmail.com>:
> I have run the status and stat command below is the output.
>
> ceph -s
>
> cluster:
> id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a
> health: HEALTH_WARN
> 2 failed cephadm daemon(s)
> 1 filesystem is degraded
> insufficient standby MDS daemons available
> 7 daemons have recently crashed
>
> services:
> mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 20h)
> mgr: strg-node2.unyimy(active, since 20h), standbys:
strg-node1.ivkfid
> mds: 1/1 daemons up
> osd: 32 osds: 32 up (since 20h), 32 in (since 10w)
>
> data:
> volumes: 0/1 healthy, 1 recovering
> pools: 3 pools, 321 pgs
> objects: 15.49M objects, 54 TiB
> usage: 109 TiB used, 66 TiB / 175 TiB avail
> pgs: 317 active+clean
> 4 active+clean+scrubbing+deep
>
>
> ceph mds stat
> mumstrg:1/1 {0=mumstrg.strg-node1.gchapr=up:replay(laggy or crashed)}
>
> ceph osd lspools
> 1 device_health_metrics
> 2 cephfs.mumstrg.meta
> 3 cephfs.mumstrg.data
>
>
>
> On Thu, Apr 17, 2025 at 10:33 AM Eugen Block <ebl...@nde.ag> wrote:
>
>> What’s your overall Ceph status? It says data pool 3 not found.
>>
>> Zitat von Amudhan P <amudha...@gmail.com>:
>>
>> > There are few more logs in MDS. I have highlighted few log lines
which I
>> am
>> > not sure what it is.
>> >
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -79>
>> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
>> > register_command dump inode hook 0x560a2c354580
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -78>
>> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
>> > register_command exit hook 0x560a2c354580
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -77>
>> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
>> > register_command respawn hook 0x560a2c354580
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -76>
>> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
>> > register_command heap hook 0x560a2c354580
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -75>
>> > 2025-04-16T14:43:59.170+0000 7f74b5030700 1
>> mds.mumstrg.strg-node3.xhxbwx
>> > Updating MDS map to version 127517 f
>> > rom mon.2
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -74>
>> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
>> > register_command cpu_profiler hook 0x560a2c35458
>> >
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -73>
>> > 2025-04-16T14:43:59.170+0000 7f74b302c700 5
>> > mds.beacon.mumstrg.strg-node3.xhxbwx Sending beacon up:boot seq 1
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -72>
>> > 2025-04-16T14:43:59.170+0000 7f74b302c700 10 monclient:
_send_mon_message
>> > to mon.strg-node3 at v2:10.0.103.3:3300/
>> >
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -71>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1
>> mds.mumstrg.strg-node3.xhxbwx
>> > Updating MDS map to version 127518 f
>> > rom mon.2
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -70>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -69>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
_send_mon_message
>> > to mon.strg-node3 at v2:10.0.103.3:3300/
>> >
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -68>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue
>> operator():
>> > data pool 3 not found in OSDMap
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -67>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 5 asok(0x560a2c44e000)
>> > register_command objecter_requests hook 0x560a2c
>> > 3544c0
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -66>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -65>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
_send_mon_message
>> > to mon.strg-node3 at v2:10.0.103.3:3300/
>> >
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -64>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 log_channel(cluster)
>> > update_config to_monitors: true to_syslog: false
>> > syslog_facility: daemon prio: info to_graylog: false graylog_host:
>> > 127.0.0.1 graylog_port: 12201)
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -63>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue
>> operator():
>> > data pool 3 not found in OSDMap
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -62>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.0 handle_osd_map
epoch
>> > 0, 0 new blocklist entries
>> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -61>
>> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518
handle_mds_map
>> i
>> > am now mds.0.127518
>> >
>> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -60>
>> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518
handle_mds_map
>> >> state change up:boot --> up:replay
>> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -59>
>> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 5
>> >> mds.beacon.mummasstrg.strg-node3.xhxbwx set_want_state: up:boot ->
>> up:replay
>> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -58>
>> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518
replay_start
>> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -57>
>> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 waiting
for
>> >> osdmap 45749 (which blocklists prior instance)*
>> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -56>
>> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
>> _send_mon_message
>> >> to mon.strg-node3 at v2:10.0.103.3:3300/0
>> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -55>
>> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue
>> operator():
>> >> data pool 3 not found in OSDMap*
>> >>
>> >
>> >
>> > On Thu, Apr 17, 2025 at 7:06 AM Amudhan P <amudha...@gmail.com>
wrote:
>> >
>> >> Eugen,
>> >>
>> >> This is the output for the command
>> >> cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue journal
>> >> inspect
>> >> Overall journal integrity: OK
>> >> cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal
inspect
>> >> Overall journal integrity: OK
>> >>
>> >> On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag> wrote:
>> >>
>> >>> I think either your mdlog or the purge_queue journal is corrupted:
>> >>>
>> >>> 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting:
2:
>> >>> waiting for purge queue recovered
>> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug -1>
>> >>> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient:
>> get_auth_request
>> >>> con 0x562856a25400 auth_method 0
>> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug 0>
>> >>> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
>> >>> (Segmentation fault) **
>> >>> Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700
>> >>> thread_name:md_log_replay
>> >>>
>> >>> Can you paste the output of this command?
>> >>>
>> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue
>> >>> journal inspect
>> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog
journal
>> >>> inspect
>> >>>
>> >>> I expect one or more damaged entries. Check this thread for more
>> details:
>> >>>
>> >>> https://www.spinics.net/lists/ceph-users/msg80124.html
>> >>>
>> >>> You should try to backup the journal, but in my case that wasn't
>> >>> possible, so I had no other choice than resetting it.
>> >>>
>> >>> Zitat von Amudhan P <amudha...@gmail.com>:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> > I am having 2 problems with my Ceph version 16.2.6
>> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
deployed
>> >>> thru
>> >>> > cephadm.
>> >>> >
>> >>> > First issue :-
>> >>> > 1 out 3 mon service went out of quorum .
>> >>> > When restarted service it comes normal but after a few minutes in
>> ceph
>> >>> > watch log it reports slow ops and mon goes out of quorum.
>> >>> > Node where this mon service failed had one weird thing that I
could
>> see
>> >>> 40%
>> >>> > of wait in the top command. But I don't see any error in dmesg or
>> >>> anything
>> >>> > related to drive IO error.
>> >>> > Below are the logs that were printed in ceph watch command.
>> >>> >
>> >>> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN]
MON_DOWN:
>> 1/3
>> >>> > mons down, quorum strg-node2,strg-node3
>> >>> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN]
>> mon.strg-node1
>> >>> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down
>> (out
>> >>> of
>> >>> > quorum)
>> >>> >
>> >>> > For now this is not appearing again.
>> >>> >
>> >>> >
>> >>> > Second issue Cephfs degraded :-
>> >>> > I have 2 MDS services running in 2 different nodes. Both are in a
>> >>> stopped
>> >>> > state.
>> >>> > when running Ceph -s command
>> >>> >
>> >>> > cluster:
>> >>> > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a
>> >>> > health: HEALTH_WARN
>> >>> > 2 failed cephadm daemon(s)
>> >>> > 1 filesystem is degraded
>> >>> > insufficient standby MDS daemons available
>> >>> >
>> >>> > services:
>> >>> > mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age
4h)
>> >>> > mgr: strg-node2.unyimy(active, since 4h), standbys:
>> >>> strg-node1.ivkfid
>> >>> > mds: 1/1 daemons up
>> >>> > osd: 32 osds: 32 up (since 4h), 32 in (since 10w)
>> >>> >
>> >>> > data:
>> >>> > volumes: 0/1 healthy, 1 recovering
>> >>> > pools: 3 pools, 321 pgs
>> >>> > objects: 15.49M objects, 54 TiB
>> >>> > usage: 109 TiB used, 66 TiB / 175 TiB avail
>> >>> > pgs: 321 active+clean
>> >>> >
>> >>> > Volume shows recovering but there wasn't any progress till now
even
>> >>> manual
>> >>> > start mds service fails again. In Ceph -s command under services
it
>> >>> shows
>> >>> > mds up no any mds service is running.
>> >>> >
>> >>> > Below is a log snip from one of the mds service.
>> >>> >
>> >>> >
>> >>> > -25> 2025-04-16T09:59:29.954+0000 7f43d0874700 1
>> >>> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320,
>> ex>
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -24>
>> >>> > 2025-04-16T09:59:29.954+0000 7f43d0874700 1
mds.0.journaler.pq(ro)
>> >>> probing
>> >>> > for end of the log
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -23>
>> >>> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a17400 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -22>
>> >>> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a17c00 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -21>
>> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1
>> mds.0.journaler.mdlog(ro)
>> >>> > recover start
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -20>
>> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1
>> mds.0.journaler.mdlog(ro)
>> >>> > read_head
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -19>
>> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log Waiting for
>> >>> journal
>> >>> > 0x200 to recover...
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -18>
>> >>> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a25000 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -17>
>> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1
mds.0.journaler.pq(ro)
>> >>> > _finish_probe_end write_pos = 13968309289 (hea>
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -16>
>> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 4 mds.0.purge_queue
>> >>> operator():
>> >>> > open complete
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -15>
>> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1
mds.0.journaler.pq(ro)
>> >>> > set_writeable
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -14>
>> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1
>> mds.0.journaler.mdlog(ro)
>> >>> > _finish_read_head loghead(trim 189741504921>
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -13>
>> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1
>> mds.0.journaler.mdlog(ro)
>> >>> > probing for end of the log
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -12>
>> >>> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a25c00 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -11>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43cf872700 1
>> mds.0.journaler.mdlog(ro)
>> >>> > _finish_probe_end write_pos = 1897428915052>
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -10>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Journal
0x200
>> >>> > recovered.
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -9>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Recovered
>> journal
>> >>> > 0x200 in format 1
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -8>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506
Booting: 1:
>> >>> > loading/discovering base inodes
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -7>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating
>> system
>> >>> > inode with ino:0x100
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -6>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating
>> system
>> >>> > inode with ino:0x1
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -5>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a25800 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -4>
>> >>> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a5dc00 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -3>
>> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506
Booting: 2:
>> >>> > replaying mds log
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -2>
>> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506
Booting: 2:
>> >>> > waiting for purge queue recovered
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1>
>> >>> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient:
>> get_auth_request
>> >>> > con 0x562856a25400 auth_method 0
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0>
>> >>> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
>> >>> > (Segmentation fault) **
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700
>> >>> > thread_name:md_log_replay
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: ceph version 16.2.6
>> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 1:
>> >>> /lib64/libpthread.so.0(+0x12b20)
>> >>> > [0x7f43dd293b20]
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 2:
>> >>> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00]
>> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of the
>> executable,
>> >>> or
>> >>> > `objdump -rdS <executable>` is needed to interpret this.
>> >>> >
>> >>> >
>> >>> > Not sure what caused the issue. I couldn't find any resources to
fix
>> >>> this
>> >>> > issue.
>> >>> > Need help from someone to bring the ceph cluster online.
>> >>> > _______________________________________________
>> >>> > ceph-users mailing list -- ceph-users@ceph.io
>> >>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@ceph.io
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>>
>> >>
>>
>>
>>
>>