What is the state of your PGs? could you post "ceph -s" I believe (but a bit of an assumption after encountering something similar myself) that under the hood cephadm is using the "ceph osd safe-to-destroy osd.X" command and when OSD.X is no longer running and not all PGs are active+clean (for instance in a active+remapped state) the safe-to-destroy command will return in the negative with the response "OSD.X not reporting stats, not all PGs are active+clean, cannot draw any conclusions" or some such msg. The cephadm osd removal would stall in that state until all PGs reach active+clean.
Respectfully, *Wes Dillingham* LinkedIn <http://www.linkedin.com/in/wesleydillingham> w...@wesdillingham.com On Tue, May 28, 2024 at 11:43 AM Matthew Vernon <mver...@wikimedia.org> wrote: > Hi, > > I want to prepare a failed disk for replacement. I did: > ceph orch osd rm 35 --zap --replace > > and it's now in the state "Done, waiting for purge", with 0 pgs, and > REPLACE and ZAP set to true. It's been like this for some hours, and now > my cluster is unhappy: > > [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm > stray daemon osd.35 on host moss-be1002 not managed by cephadm > > (the OSD is down & out) > > ...and also neither the disk nor the relevant NVME LV has been zapped. > > I have my OSDs deployed via a spec: > service_type: osd > service_id: rrd_single_NVMe > placement: > label: "NVMe" > spec: > data_devices: > rotational: 1 > db_devices: > model: "NVMe" > > And before issuing the ceph orch osd rm I set that to be unmanaged (ceph > orch set-unmanaged osd.rrd_single_NVMe), as obviously I don't want ceph > to just try and re-make a new OSD on the sad disk. > > I'd expected from the docs[0] that what I did would leave me with a > system ready for the failed disk to be swapped (and that I could then > mark osd.rrd_single_NVMe as managed again, and a new OSD built), > including removing/wiping the NVME lv so it can be removed. > > What did I do wrong? I don't much care about the OSD id (but obviously > it's neater to not just incrementally increase OSD numbers every time a > disk died), but I thought that telling ceph orch not to make new OSDs > then using ceph orch osd rm to zap the disk and NVME lv would have been > the way to go... > > Thanks, > > Matthew > > [0] https://docs.ceph.com/en/reef/cephadm/services/osd/#replacing-an-osd > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io