[ceph-users] Ceph orchestrator not refreshing device list

Bob Gibson Wed, 25 Sep 2024 11:24:10 -0700

Hi,

We recently converted a legacy cluster running Quincy v17.2.7 to cephadm. The 
conversion went smoothly and left all osds unmanaged by the orchestrator as 
expected. We’re now in the process of converting the osds to be managed by the 
orchestrator. We successfully converted a few of them, but then the 
orchestrator somehow got confused. `ceph health detail` reports a “stray 
daemon” for the osd we’re trying to convert, and the orchestrator is unable to 
refresh its device list so it doesn’t see any available devices.


From the perspective of the osd node, the osd has been wiped and is ready to be 
reinstalled. We’ve also rebooted the node for good measure. `ceph osd tree` 
shows that the osd has been destroyed, but the orchestrator won’t reinstall it 
because it thinks the device is still active. The orchestrator device 
information is stale, but we’re unable to refresh it. The usual recommended 
workaround of failing over the mgr hasn’t helped. We’ve also tried `ceph orch 
device ls —refresh` to no avail. In fact after running that command subsequent 
runs of `ceph orch device ls` produce no output until the mgr is failed over 
again.

Is there a way to force the orchestrator to refresh its list of devices when in 
this state? If not, can anyone offer any suggestions on how to fix this problem?

Cheers,
/rjg

P.S. Some additional information in case it’s helpful...

We’re using the following command to replace existing devices so that they’re 
managed by the orchestrator:

```
ceph orch osd rm <osd> --replace —zap
```

and we’re currently stuck on osd 88.

```
ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
    stray daemon osd.88 on host ceph-osd31 not managed by cephadm
```

`ceph osd tree` shows that the osd has been destroyed and is ready to be 
replaced:

```
ceph osd tree-from ceph-osd31
ID   CLASS  WEIGHT    TYPE NAME        STATUS     REWEIGHT  PRI-AFF
-46         34.93088  host ceph-osd31
 84    ssd   3.49309      osd.84              up   1.00000  1.00000
 85    ssd   3.49309      osd.85              up   1.00000  1.00000
 86    ssd   3.49309      osd.86              up   1.00000  1.00000
 87    ssd   3.49309      osd.87              up   1.00000  1.00000
 88    ssd   3.49309      osd.88       destroyed         0  1.00000
 89    ssd   3.49309      osd.89              up   1.00000  1.00000
 90    ssd   3.49309      osd.90              up   1.00000  1.00000
 91    ssd   3.49309      osd.91              up   1.00000  1.00000
 92    ssd   3.49309      osd.92              up   1.00000  1.00000
 93    ssd   3.49309      osd.93              up   1.00000  1.00000
```

The cephadm log shows a claim on node `ceph-osd31` for that osd:

```
2024-09-25T14:15:45.699348-0400 mgr.ceph-mon3.qzjgws [INF] Found osd claims -> 
{'ceph-osd31': ['88']}
2024-09-25T14:15:45.699534-0400 mgr.ceph-mon3.qzjgws [INF] Found osd claims for 
drivegroup ceph-osd31 -> {'ceph-osd31': ['88']}
```

`ceph orch device ls` shows that the device list isn’t refreshing:

```
ceph orch device ls ceph-osd31
HOST        PATH      TYPE  DEVICE ID                                SIZE  
AVAILABLE  REFRESHED  REJECT REASONS
ceph-osd31  /dev/sdc  ssd   INTEL_SSDSC2KG038T8_PHYG039603PE3P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdd  ssd   INTEL_SSDSC2KG038T8_PHYG039600AY3P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sde  ssd   INTEL_SSDSC2KG038T8_PHYG039600CW3P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdf  ssd   INTEL_SSDSC2KG038T8_PHYG039600CM3P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdg  ssd   INTEL_SSDSC2KG038T8_PHYG039600UB3P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdh  ssd   INTEL_SSDSC2KG038T8_PHYG039603753P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdi  ssd   INTEL_SSDSC2KG038T8_PHYG039603R63P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdj  ssd   INTEL_SSDSC2KG038TZ_PHYJ4011032M3P8DGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdk  ssd   INTEL_SSDSC2KG038TZ_PHYJ3234010J3P8DGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdl  ssd   INTEL_SSDSC2KG038T8_PHYG039603NS3P8EGN  3576G  No   
      22h ago    Insufficient space (<10 extents) on vgs, LVM detected, locked
```

`ceph node ls` thinks the osd still exists

```
ceph node ls osd | jq -r '."ceph-osd31"'
[
  84,
  85,
  86,
  87,
  88, <— this shouldn’t exist
  89,
  90,
  91,
  92,
  93
]
```

Each osd node has 10x 3.8 TB ssd drives for osds. On `ceph-osd31`, cephadm 
doesn’t see osd.88 as expected:

```
cephadm ls --no-detail
[
    {
        "style": "cephadm:v1",
        "name": "osd.93",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.93"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.85",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.85"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.90",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.90"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.92",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.92"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.89",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.89"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.87",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.87"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.86",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.86"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.84",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.84"
    },
    {
        "style": "cephadm:v1",
        "name": "osd.91",
        "fsid": "9b3b3539-59a9-4338-8bab-3badfab6e855",
        "systemd_unit": "ceph-9b3b3539-59a9-4338-8bab-3badfab6e855@osd.91"
    }
]
```

`lsblk` shows that `/dev/sdg` has been wiped.

```
NAME                                                                            
                      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                             
                        8:0    0 223.6G  0 disk
|-sda1                                                                          
                        8:1    0    94M  0 part
`-sda2                                                                          
                        8:2    0 223.5G  0 part
  `-md0                                                                         
                        9:0    0 223.4G  0 raid1 /
sdb                                                                             
                        8:16   0 223.6G  0 disk
|-sdb1                                                                          
                        8:17   0    94M  0 part
`-sdb2                                                                          
                        8:18   0 223.5G  0 part
  `-md0                                                                         
                        9:0    0 223.4G  0 raid1 /
sdc                                                                             
                        8:32   1   3.5T  0 disk
`-ceph--03782b4c--9faa--49f5--b554--98e7b8515834-osd--block--ba272724--daa6--45f5--9f69--789cc0bda077
 253:3    0   3.5T  0 lvm
  `-keCkP2-o6h8-jKkw-RKiD-UBFf-A8EL-JDJGPR                                      
                      253:9    0   3.5T  0 crypt
sdd                                                                             
                        8:48   1   3.5T  0 disk
`-ceph--c07907d8--4a75--4ba3--b5e1--2ebf49ecbdf6-osd--block--58d1d50d--6228--4e6f--9a52--2a305ba00700
 253:7    0   3.5T  0 lvm
  `-WB8Mxn-qCHI-4T01-imiG-hNBR-by60-YuxgfD                                      
                      253:11   0   3.5T  0 crypt
sde                                                                             
                        8:64   1   3.5T  0 disk
`-ceph--6f9d4df4--7ce6--44a4--a7b1--62c85af8cfe0-osd--block--aabcb30d--0084--490a--969b--78f7af6e94da
 253:8    0   3.5T  0 lvm
  `-g9qErH-vTXY-JQbs-eh61-W0Mn-TAV8-gof4zy                                      
                      253:12   0   3.5T  0 crypt
sdf                                                                             
                        8:80   1   3.5T  0 disk
`-ceph--d6b728f8--e365--46db--b30f--6c00805c752b-osd--block--88426db7--2322--4807--ac2e--b49929e170d6
 253:6    0   3.5T  0 lvm
  `-LNG2gB-pa0w-gl2v-hVQ3-6qTd-aXsR-Lenri3                                      
                      253:10   0   3.5T  0 crypt
sdg                                                                             
                        8:96   1   3.5T  0 disk
sdh                                                                             
                        8:112  1   3.5T  0 disk
`-ceph--de2cfee6--8e0a--4aa0--9e6b--90c09025768c-osd--block--a3b86251--2799--4243--a857--f218fa90f29a
 253:2    0   3.5T  0 lvm
sdi                                                                             
                        8:128  1   3.5T  0 disk
`-ceph--30dee450--0fdd--46ea--9eec--6a4c7706df9c-osd--block--bfc090db--dde4--47dd--a1c9--1cd838ea43b3
 253:4    0   3.5T  0 lvm
sdj                                                                             
                        8:144  1   3.5T  0 disk
`-ceph--78febcf5--43f4--4820--8dc7--0f6c22816c9f-osd--block--da1e69c7--6427--4562--8290--90bcb9526747
 253:0    0   3.5T  0 lvm
sdk                                                                             
                        8:160  1   3.5T  0 disk
`-ceph--fe210281--b1f5--4d5e--9ab0--2f226612af00-osd--block--6bb9f308--e853--4303--83ea--553c3a3513e1
 253:1    0   3.5T  0 lvm
sdl                                                                             
                        8:176  1   3.5T  0 disk
`-ceph--9f21c916--f211--4d1b--8214--6ad1cecac810-osd--block--572d850c--c201--4af4--ac42--0ed2a6ed73ed
 253:5    0   3.5T  0 lvm
```

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph orchestrator not refreshing device list

Reply via email to