Hi Cephers, I have recently completed Reef to Squid upgrade in our test environment, After the upgrade, some of the OSDs did not come up. podman containers are up and running but some OSDs did not join the cluster and seem down.
After digging into logs, I saw that down OSDs try to boot with osdmap epoch 70297 but the current osdmap epoch is 72555. All the running OSDs use the latest osdmap on the logs, but the failing ones use the same and older osdmap (epcoh 70297). This is a single-node test cluster and has been running for years. All the OSDs are on the same host and all the containers related to OSDs are running. After enabling debug on a failing OSD, I can see that there are several ticks with the given epoch like below. 2025-10-23T21:00:23.945+0000 7f3e05b79640 20 osd.41 70297 tick last_purged_snaps_scrub 2025-10-22T22:20:42.570629+0000 next 2025-10-24T03:13:12.570629+000 I tried to get the latest osdmap via “ceph osd getmap 72555 > /tmp/osdmap.72555” and set with the command: CEPH_ARGS="--bluestore-ignore-data-csum" ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-41/ --op set-osdmap --file /tmp/osd_map_72555" but it fails with the error: osdmap (#-1:9c8e9ef2:::osdmap.72555:0#) does not exist. I can verify all the PGs and the underlying objects are staying safe on the disk by using the ceph-bluestore-tool. Does anyone have a clue why some of the OSDs do not get the latest osdmap from the mon? Thanks in advance, Huseyin [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
