I have run the status and stat command below is the output. ceph -s
cluster: id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a health: HEALTH_WARN 2 failed cephadm daemon(s) 1 filesystem is degraded insufficient standby MDS daemons available 7 daemons have recently crashed services: mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 20h) mgr: strg-node2.unyimy(active, since 20h), standbys: strg-node1.ivkfid mds: 1/1 daemons up osd: 32 osds: 32 up (since 20h), 32 in (since 10w) data: volumes: 0/1 healthy, 1 recovering pools: 3 pools, 321 pgs objects: 15.49M objects, 54 TiB usage: 109 TiB used, 66 TiB / 175 TiB avail pgs: 317 active+clean 4 active+clean+scrubbing+deep ceph mds stat mumstrg:1/1 {0=mumstrg.strg-node1.gchapr=up:replay(laggy or crashed)} ceph osd lspools 1 device_health_metrics 2 cephfs.mumstrg.meta 3 cephfs.mumstrg.data On Thu, Apr 17, 2025 at 10:33 AM Eugen Block <ebl...@nde.ag> wrote: > What’s your overall Ceph status? It says data pool 3 not found. > > Zitat von Amudhan P <amudha...@gmail.com>: > > > There are few more logs in MDS. I have highlighted few log lines which I > am > > not sure what it is. > > > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -79> > > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) > > register_command dump inode hook 0x560a2c354580 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -78> > > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) > > register_command exit hook 0x560a2c354580 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -77> > > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) > > register_command respawn hook 0x560a2c354580 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -76> > > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) > > register_command heap hook 0x560a2c354580 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -75> > > 2025-04-16T14:43:59.170+0000 7f74b5030700 1 > mds.mumstrg.strg-node3.xhxbwx > > Updating MDS map to version 127517 f > > rom mon.2 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -74> > > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000) > > register_command cpu_profiler hook 0x560a2c35458 > > > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -73> > > 2025-04-16T14:43:59.170+0000 7f74b302c700 5 > > mds.beacon.mumstrg.strg-node3.xhxbwx Sending beacon up:boot seq 1 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -72> > > 2025-04-16T14:43:59.170+0000 7f74b302c700 10 monclient: _send_mon_message > > to mon.strg-node3 at v2:10.0.103.3:3300/ > > > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -71> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 > mds.mumstrg.strg-node3.xhxbwx > > Updating MDS map to version 127518 f > > rom mon.2 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -70> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -69> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message > > to mon.strg-node3 at v2:10.0.103.3:3300/ > > > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -68> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue > operator(): > > data pool 3 not found in OSDMap > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -67> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 5 asok(0x560a2c44e000) > > register_command objecter_requests hook 0x560a2c > > 3544c0 > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -66> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -65> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message > > to mon.strg-node3 at v2:10.0.103.3:3300/ > > > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -64> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 log_channel(cluster) > > update_config to_monitors: true to_syslog: false > > syslog_facility: daemon prio: info to_graylog: false graylog_host: > > 127.0.0.1 graylog_port: 12201) > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -63> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue > operator(): > > data pool 3 not found in OSDMap > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -62> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.0 handle_osd_map epoch > > 0, 0 new blocklist entries > > Apr 16 20:13:59 strg-node3 bash[59114]: debug -61> > > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 handle_mds_map > i > > am now mds.0.127518 > > > >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -60> > >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 handle_mds_map > >> state change up:boot --> up:replay > >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -59> > >> 2025-04-16T14:43:59.254+0000 7f74b5030700 5 > >> mds.beacon.mummasstrg.strg-node3.xhxbwx set_want_state: up:boot -> > up:replay > >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -58> > >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 replay_start > >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -57> > >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 waiting for > >> osdmap 45749 (which blocklists prior instance)* > >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -56> > >> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: > _send_mon_message > >> to mon.strg-node3 at v2:10.0.103.3:3300/0 > >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -55> > >> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue > operator(): > >> data pool 3 not found in OSDMap* > >> > > > > > > On Thu, Apr 17, 2025 at 7:06 AM Amudhan P <amudha...@gmail.com> wrote: > > > >> Eugen, > >> > >> This is the output for the command > >> cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue journal > >> inspect > >> Overall journal integrity: OK > >> cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal inspect > >> Overall journal integrity: OK > >> > >> On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <ebl...@nde.ag> wrote: > >> > >>> I think either your mdlog or the purge_queue journal is corrupted: > >>> > >>> 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: > >>> waiting for purge queue recovered > >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> > >>> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: > get_auth_request > >>> con 0x562856a25400 auth_method 0 > >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> > >>> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal > >>> (Segmentation fault) ** > >>> Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700 > >>> thread_name:md_log_replay > >>> > >>> Can you paste the output of this command? > >>> > >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue > >>> journal inspect > >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog journal > >>> inspect > >>> > >>> I expect one or more damaged entries. Check this thread for more > details: > >>> > >>> https://www.spinics.net/lists/ceph-users/msg80124.html > >>> > >>> You should try to backup the journal, but in my case that wasn't > >>> possible, so I had no other choice than resetting it. > >>> > >>> Zitat von Amudhan P <amudha...@gmail.com>: > >>> > >>> > Hi, > >>> > > >>> > I am having 2 problems with my Ceph version 16.2.6 > >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) deployed > >>> thru > >>> > cephadm. > >>> > > >>> > First issue :- > >>> > 1 out 3 mon service went out of quorum . > >>> > When restarted service it comes normal but after a few minutes in > ceph > >>> > watch log it reports slow ops and mon goes out of quorum. > >>> > Node where this mon service failed had one weird thing that I could > see > >>> 40% > >>> > of wait in the top command. But I don't see any error in dmesg or > >>> anything > >>> > related to drive IO error. > >>> > Below are the logs that were printed in ceph watch command. > >>> > > >>> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] MON_DOWN: > 1/3 > >>> > mons down, quorum strg-node2,strg-node3 > >>> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN] > mon.strg-node1 > >>> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down > (out > >>> of > >>> > quorum) > >>> > > >>> > For now this is not appearing again. > >>> > > >>> > > >>> > Second issue Cephfs degraded :- > >>> > I have 2 MDS services running in 2 different nodes. Both are in a > >>> stopped > >>> > state. > >>> > when running Ceph -s command > >>> > > >>> > cluster: > >>> > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a > >>> > health: HEALTH_WARN > >>> > 2 failed cephadm daemon(s) > >>> > 1 filesystem is degraded > >>> > insufficient standby MDS daemons available > >>> > > >>> > services: > >>> > mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 4h) > >>> > mgr: strg-node2.unyimy(active, since 4h), standbys: > >>> strg-node1.ivkfid > >>> > mds: 1/1 daemons up > >>> > osd: 32 osds: 32 up (since 4h), 32 in (since 10w) > >>> > > >>> > data: > >>> > volumes: 0/1 healthy, 1 recovering > >>> > pools: 3 pools, 321 pgs > >>> > objects: 15.49M objects, 54 TiB > >>> > usage: 109 TiB used, 66 TiB / 175 TiB avail > >>> > pgs: 321 active+clean > >>> > > >>> > Volume shows recovering but there wasn't any progress till now even > >>> manual > >>> > start mds service fails again. In Ceph -s command under services it > >>> shows > >>> > mds up no any mds service is running. > >>> > > >>> > Below is a log snip from one of the mds service. > >>> > > >>> > > >>> > -25> 2025-04-16T09:59:29.954+0000 7f43d0874700 1 > >>> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320, > ex> > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -24> > >>> > 2025-04-16T09:59:29.954+0000 7f43d0874700 1 mds.0.journaler.pq(ro) > >>> probing > >>> > for end of the log > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -23> > >>> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient: > get_auth_request > >>> > con 0x562856a17400 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -22> > >>> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient: > get_auth_request > >>> > con 0x562856a17c00 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -21> > >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 > mds.0.journaler.mdlog(ro) > >>> > recover start > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -20> > >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1 > mds.0.journaler.mdlog(ro) > >>> > read_head > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -19> > >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log Waiting for > >>> journal > >>> > 0x200 to recover... > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -18> > >>> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient: > get_auth_request > >>> > con 0x562856a25000 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -17> > >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro) > >>> > _finish_probe_end write_pos = 13968309289 (hea> > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -16> > >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 4 mds.0.purge_queue > >>> operator(): > >>> > open complete > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -15> > >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro) > >>> > set_writeable > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -14> > >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 > mds.0.journaler.mdlog(ro) > >>> > _finish_read_head loghead(trim 189741504921> > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -13> > >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1 > mds.0.journaler.mdlog(ro) > >>> > probing for end of the log > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -12> > >>> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient: > get_auth_request > >>> > con 0x562856a25c00 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -11> > >>> > 2025-04-16T09:59:30.098+0000 7f43cf872700 1 > mds.0.journaler.mdlog(ro) > >>> > _finish_probe_end write_pos = 1897428915052> > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -10> > >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Journal 0x200 > >>> > recovered. > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -9> > >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Recovered > journal > >>> > 0x200 in format 1 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -8> > >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506 Booting: 1: > >>> > loading/discovering base inodes > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -7> > >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating > system > >>> > inode with ino:0x100 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -6> > >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating > system > >>> > inode with ino:0x1 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -5> > >>> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient: > get_auth_request > >>> > con 0x562856a25800 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -4> > >>> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient: > get_auth_request > >>> > con 0x562856a5dc00 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -3> > >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: > >>> > replaying mds log > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -2> > >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2: > >>> > waiting for purge queue recovered > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1> > >>> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient: > get_auth_request > >>> > con 0x562856a25400 auth_method 0 > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0> > >>> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal > >>> > (Segmentation fault) ** > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700 > >>> > thread_name:md_log_replay > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: ceph version 16.2.6 > >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 1: > >>> /lib64/libpthread.so.0(+0x12b20) > >>> > [0x7f43dd293b20] > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 2: > >>> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00] > >>> > Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of the > executable, > >>> or > >>> > `objdump -rdS <executable>` is needed to interpret this. > >>> > > >>> > > >>> > Not sure what caused the issue. I couldn't find any resources to fix > >>> this > >>> > issue. > >>> > Need help from someone to bring the ceph cluster online. > >>> > _______________________________________________ > >>> > ceph-users mailing list -- ceph-users@ceph.io > >>> > To unsubscribe send an email to ceph-users-le...@ceph.io > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>> > >> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io