For now I set the service to "unmanaged" to prevent further log flooding. But I would still like to know why the cache is not updated properly.

Zitat von Eugen Block <ebl...@nde.ag>:

Good morning,

I noticed something strange on a 18.2.7 cluster, running on Ubuntu 22.04, deployed by cephadm. There are 10 hosts in total, 5 of them are all-flash and those aren't affected. The other 5 hosts are hdd-only, and only 4 of those are affected:

The /var/log/ceph/{FSID}/ceph-volume.log is flooded with attempts to apply the osd spec:


[2025-07-17 05:40:01,994][ceph_volume.main][INFO ] Running command: ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw --yes --no-systemd [2025-07-17 05:42:00,216][ceph_volume.main][INFO ] Running command: ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw --yes --no-systemd [2025-07-17 05:43:50,521][ceph_volume.main][INFO ] Running command: ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw --yes --no-systemd


So that log file alone grows to more than 1 GB per day, as a consequence other logs like syslog grow as well.
But for some reason, storage08 is skipped, the mgr reports:

[cephadm DEBUG root] skipping apply of storage08 on DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd

This is the current hdd-only spec:

# ceph orch ls osd --export
service_type: osd
service_id: hdd-only
service_name: osd.hdd-only
placement:
  hosts:
  - storage06
  - storage07
  - storage08
  - storage09
  - storage10
spec:
  data_devices:
    rotational: 1
    size: '1T:'
  filter_logic: AND
  objectstore: bluestore


I verified that all disks are deployed as OSDs, so there's no orphan lying around or anything. I failed the mgr (of course), rebooted one host because storage08 was recently rebooted as well. I don't know for how long this has been going on, unfortunately.

So I started to look at the code [0], [1], it seems like the mgr cache is not properly updated:

if not self.mgr.cache.osdspec_needs_apply(host, drive_group):
self.mgr.log.debug("skipping apply of %s on %s (no change)" % (
                    host, drive_group))


So I looked at all these values:

    def osdspec_needs_apply(self, host: str, spec: ServiceSpec) -> bool:
        if (
            host not in self.devices
            or host not in self.last_device_change
            or host not in self.last_device_update
            or host not in self.osdspec_last_applied
            or spec.service_name() not in self.osdspec_last_applied[host]
        ):

but all keys are populated with similar values as on storage08, for example:


root@storage01:~# ceph config-key get mgr/cephadm/host.storage10 | jq -r '.last_device_change,.last_device_update,.osdspec_last_applied'
2025-02-12T16:27:21.979015Z
2025-07-17T05:35:18.852618Z
{
  "osd.hdd-only": "2025-07-17T06:03:38.860971Z"
}

root@storage01:~# ceph config-key get mgr/cephadm/host.storage08 | jq -r '.last_device_change,.last_device_update,.osdspec_last_applied'
2025-03-11T08:23:02.851969Z
2025-07-17T05:43:47.521682Z
{
  "osd.hdd-only": "2025-03-11T08:23:21.494004Z"
}


Can anyone make sense of it? I'd appreciate any pointers!

Thanks!
Eugen

[0] https://github.com/ceph/ceph/blob/v18.2.7/src/pybind/mgr/cephadm/services/osd.py#L42 [1] https://github.com/ceph/ceph/blob/v18.2.7/src/pybind/mgr/cephadm/inventory.py#L1316


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to