Hi, We recently converted a legacy cluster running Quincy v17.2.7 to cephadm. The conversion went smoothly and left all osds unmanaged by the orchestrator as expected. We’re now in the process of converting the osds to be managed by the orchestrator. We successfully converted a few of them, but then the orchestrator somehow got confused. `ceph health detail` reports a “stray daemon” for the osd we’re trying to convert, and the orchestrator is unable to refresh its device list so it doesn’t see any available devices.
From the perspective of the osd node, the osd has been wiped and is ready to be reinstalled. We’ve also rebooted the node for good measure. `ceph osd tree` shows that the osd has been destroyed, but the orchestrator won’t reinstall it because it thinks the device is still active. The orchestrator device information is stale, but we’re unable to refresh it. The usual recommended workaround of failing over the mgr hasn’t helped. We’ve also tried `ceph orch device ls —refresh` to no avail. In fact after running that command subsequent runs of `ceph orch device ls` produce no output until the mgr is failed over again. Is there a way to force the orchestrator to refresh its list of devices when in this state? If not, can anyone offer any suggestions on how to fix this problem? Cheers, /rjg P.S. Some additional information in case it’s helpful... We’re using the following command to replace existing devices so that they’re managed by the orchestrator: ``` ceph orch osd rm <osd> --replace —zap ``` and we’re currently stuck on osd 88. ``` ceph health detail HEALTH_WARN 1 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm stray daemon osd.88 on host ceph-osd31 not managed by cephadm ``` `ceph osd tree` shows that the osd has been destroyed and is ready to be replaced: ``` ceph osd tree-from ceph-osd31 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -46 34.93088 host ceph-osd31 84 ssd 3.49309 osd.84 up 1.00000 1.00000 85 ssd 3.49309 osd.85 up 1.00000 1.00000 86 ssd 3.49309 osd.86 up 1.00000 1.00000 87 ssd 3.49309 osd.87 up 1.00000 1.00000 88 ssd 3.49309 osd.88 destroyed 0 1.00000 89 ssd 3.49309 osd.89 up 1.00000 1.00000 90 ssd 3.49309 osd.90 up 1.00000 1.00000 91 ssd 3.49309 osd.91 up 1.00000 1.00000 92 ssd 3.49309 osd.92 up 1.00000 1.00000 93 ssd 3.49309 osd.93 up 1.00000 1.00000 ``` The cephadm log shows a claim on node `ceph-osd31` for that osd: ``` 2024-09-25T14:15:45.699348-0400 mgr.ceph-mon3.qzjgws [INF] Found osd claims -> {'ceph-osd31': ['88']} 2024-09-25T14:15:45.699534-0400 mgr.ceph-mon3.qzjgws [INF] Found osd claims for drivegroup ceph-osd31 -> {'ceph-osd31': ['88']} ``` `ceph orch device ls` shows that the device list isn’t refreshing: ``` ceph orch device ls ceph-osd31 HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS ceph-osd31 /dev/sdc ssd INTEL_SSDSC2KG038T8_PHYG039603PE3P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdd ssd INTEL_SSDSC2KG038T8_PHYG039600AY3P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sde ssd INTEL_SSDSC2KG038T8_PHYG039600CW3P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdf ssd INTEL_SSDSC2KG038T8_PHYG039600CM3P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdg ssd INTEL_SSDSC2KG038T8_PHYG039600UB3P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdh ssd INTEL_SSDSC2KG038T8_PHYG039603753P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdi ssd INTEL_SSDSC2KG038T8_PHYG039603R63P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdj ssd INTEL_SSDSC2KG038TZ_PHYJ4011032M3P8DGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdk ssd INTEL_SSDSC2KG038TZ_PHYJ3234010J3P8DGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ceph-osd31 /dev/sdl ssd INTEL_SSDSC2KG038T8_PHYG039603NS3P8EGN 3576G No 22h ago Insufficient space (<10 extents) on vgs, LVM detected, locked ``` `ceph node ls` thinks the osd still exists ``` ceph node ls osd | jq -r '."ceph-osd31"' [ 84, 85, 86, 87, 88, <— this shouldn’t exist 89, 90, 91, 92, 93 ] ``` Each osd node has 10x 3.8 TB ssd drives for osds. On `ceph-osd31`, cephadm doesn’t see osd.88 as expected: ``` cephadm ls --no-detail [ { "style": "cephadm:v1", "name": "osd.93", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.93" }, { "style": "cephadm:v1", "name": "osd.85", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.85" }, { "style": "cephadm:v1", "name": "osd.90", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.90" }, { "style": "cephadm:v1", "name": "osd.92", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.92" }, { "style": "cephadm:v1", "name": "osd.89", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.89" }, { "style": "cephadm:v1", "name": "osd.87", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.87" }, { "style": "cephadm:v1", "name": "osd.86", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.86" }, { "style": "cephadm:v1", "name": "osd.84", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.84" }, { "style": "cephadm:v1", "name": "osd.91", "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855", "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.91" } ] ``` `lsblk` shows that `/dev/sdg` has been wiped. ``` NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 223.6G 0 disk |-sda1 8:1 0 94M 0 part `-sda2 8:2 0 223.5G 0 part `-md0 9:0 0 223.4G 0 raid1 / sdb 8:16 0 223.6G 0 disk |-sdb1 8:17 0 94M 0 part `-sdb2 8:18 0 223.5G 0 part `-md0 9:0 0 223.4G 0 raid1 / sdc 8:32 1 3.5T 0 disk `-ceph--03782b4c--9faa--49f5--b554--98e7b8515834-osd--block--ba272724--daa6--45f5--9f69--789cc0bda077 253:3 0 3.5T 0 lvm `-keCkP2-o6h8-jKkw-RKiD-UBFf-A8EL-JDJGPR 253:9 0 3.5T 0 crypt sdd 8:48 1 3.5T 0 disk `-ceph--c07907d8--4a75--4ba3--b5e1--2ebf49ecbdf6-osd--block--58d1d50d--6228--4e6f--9a52--2a305ba00700 253:7 0 3.5T 0 lvm `-WB8Mxn-qCHI-4T01-imiG-hNBR-by60-YuxgfD 253:11 0 3.5T 0 crypt sde 8:64 1 3.5T 0 disk `-ceph--6f9d4df4--7ce6--44a4--a7b1--62c85af8cfe0-osd--block--aabcb30d--0084--490a--969b--78f7af6e94da 253:8 0 3.5T 0 lvm `-g9qErH-vTXY-JQbs-eh61-W0Mn-TAV8-gof4zy 253:12 0 3.5T 0 crypt sdf 8:80 1 3.5T 0 disk `-ceph--d6b728f8--e365--46db--b30f--6c00805c752b-osd--block--88426db7--2322--4807--ac2e--b49929e170d6 253:6 0 3.5T 0 lvm `-LNG2gB-pa0w-gl2v-hVQ3-6qTd-aXsR-Lenri3 253:10 0 3.5T 0 crypt sdg 8:96 1 3.5T 0 disk sdh 8:112 1 3.5T 0 disk `-ceph--de2cfee6--8e0a--4aa0--9e6b--90c09025768c-osd--block--a3b86251--2799--4243--a857--f218fa90f29a 253:2 0 3.5T 0 lvm sdi 8:128 1 3.5T 0 disk `-ceph--30dee450--0fdd--46ea--9eec--6a4c7706df9c-osd--block--bfc090db--dde4--47dd--a1c9--1cd838ea43b3 253:4 0 3.5T 0 lvm sdj 8:144 1 3.5T 0 disk `-ceph--78febcf5--43f4--4820--8dc7--0f6c22816c9f-osd--block--da1e69c7--6427--4562--8290--90bcb9526747 253:0 0 3.5T 0 lvm sdk 8:160 1 3.5T 0 disk `-ceph--fe210281--b1f5--4d5e--9ab0--2f226612af00-osd--block--6bb9f308--e853--4303--83ea--553c3a3513e1 253:1 0 3.5T 0 lvm sdl 8:176 1 3.5T 0 disk `-ceph--9f21c916--f211--4d1b--8214--6ad1cecac810-osd--block--572d850c--c201--4af4--ac42--0ed2a6ed73ed 253:5 0 3.5T 0 lvm ``` _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io