Hi Cephers,

I have recently completed Reef to Squid upgrade in our test environment, After 
the upgrade, some of the OSDs did not come up. podman containers are up and 
running but some OSDs did not join the cluster and seem down. 

After digging into logs, I saw that down OSDs try to boot with osdmap epoch 
70297 but the current osdmap epoch is 72555. All the running OSDs use the 
latest osdmap on the logs, but the failing ones use the same and older osdmap 
(epcoh 70297). 

This is a single-node test cluster and has been running for years. All the OSDs 
are on the same host and all the containers related to OSDs are running. 

After enabling debug on a failing OSD, I can see that there are several ticks 
with the given epoch like below.

2025-10-23T21:00:23.945+0000 7f3e05b79640 20 osd.41 70297 tick 
last_purged_snaps_scrub 2025-10-22T22:20:42.570629+0000 next 
2025-10-24T03:13:12.570629+000

I tried to get the latest osdmap via “ceph osd getmap 72555 > 
/tmp/osdmap.72555” and set with the command:

CEPH_ARGS="--bluestore-ignore-data-csum" ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-41/ --op set-osdmap --file /tmp/osd_map_72555"

but it fails with the error:
 
osdmap (#-1:9c8e9ef2:::osdmap.72555:0#) does not exist.

I can verify all the PGs and the underlying objects are staying safe on the 
disk by using the ceph-bluestore-tool. 

Does anyone have a clue why some of the OSDs do not get the latest osdmap from 
the mon? 

Thanks in advance,
Huseyin
[email protected]




_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to