I enabled debug logging with `ceph config set mgr 
mgr/cephadm/log_to_cluster_level debug` and viewed the logs with `ceph -W 
cephadm --watch-debug`. I can see the orchestrator refreshing the device list, 
and this is reflected in the `ceph-volume.log` file on the target osd nodes. 
When I restart the mgr, `ceph orch device ls` reports each device with “5w ago” 
under the “REFRESHED” column. After the orchestrator attempts to refresh the 
device list, `ceph orch device ls` stops outputting any data at all until I 
restart the mgr again.

I discovered that I can query the cached device data using `ceph config-key 
dump`. On the problematic cluster, the `created` attribute is stale, e.g.

ceph config-key dump | jq -r .'"mgr/cephadm/host.ceph-osd31.devices.0"' | jq 
.devices[].created
"2024-09-23T17:56:44.914535Z"
"2024-09-23T17:56:44.914569Z"
"2024-09-23T17:56:44.914591Z"
"2024-09-23T17:56:44.914612Z"
"2024-09-23T17:56:44.914632Z"
"2024-09-23T17:56:44.914652Z"
"2024-09-23T17:56:44.914672Z"
"2024-09-23T17:56:44.914692Z"
"2024-09-23T17:56:44.914711Z"
"2024-09-23T17:56:44.914732Z"

whereas on working clusters the `created` attribute is set to the time the 
device information was last cached, e.g.

ceph config-key dump | jq -r .'"mgr/cephadm/host.ceph-osd1.devices.0"' | jq 
.devices[].created
"2024-10-28T21:49:29.510593Z"
"2024-10-28T21:49:29.510635Z"
"2024-10-28T21:49:29.510657Z"
"2024-10-28T21:49:29.510678Z"

It appears that the orchestrator is polling the devices but failing to update 
the cache for some reason. It would be interesting to see what happens if I 
removed one of these device entries from the cache, but the cluster is in 
production so I’m hesitant to poke at it.

We have a maintenance window scheduled in December which will provide an 
opportunity to perform a complete restart of the cluster. Hopefully that will 
clean things up. In the meantime, I’ve set all devices to be unmanaged, and the 
cluster is otherwise healthy, so unless anyone has any other ideas to offer I 
guess I’ll just leave things as-is until the maintenance window.

Cheers,
/rjg

On Oct 25, 2024, at 10:31 AM, Bob Gibson <r...@oicr.on.ca> wrote:

[…]
My hunch is that some persistent state is corrupted, or there’s something else 
preventing the orchestrator from successfully refreshing its device status, but 
I don’t know how to troubleshoot this. Any ideas?

I don't think this is related to the 'osd' service. As suggested by Tobi, 
enabling cephadm debug will tell you more.

Agreed. I’ll dig through the logs some more today to see if I can spot any 
problems.

Cheers,
/rjg

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to