For now I set the service to "unmanaged" to prevent further log
flooding. But I would still like to know why the cache is not updated
properly.
Zitat von Eugen Block <ebl...@nde.ag>:
Good morning,
I noticed something strange on a 18.2.7 cluster, running on Ubuntu
22.04, deployed by cephadm. There are 10 hosts in total, 5 of them
are all-flash and those aren't affected. The other 5 hosts are
hdd-only, and only 4 of those are affected:
The /var/log/ceph/{FSID}/ceph-volume.log is flooded with attempts to
apply the osd spec:
[2025-07-17 05:40:01,994][ceph_volume.main][INFO ] Running command:
ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl
/dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds
/dev/sdt /dev/sdu /dev/sdv /dev/sdw --yes --no-systemd
[2025-07-17 05:42:00,216][ceph_volume.main][INFO ] Running command:
ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl
/dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds
/dev/sdt /dev/sdu /dev/sdv /dev/sdw --yes --no-systemd
[2025-07-17 05:43:50,521][ceph_volume.main][INFO ] Running command:
ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl
/dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds
/dev/sdt /dev/sdu /dev/sdv /dev/sdw --yes --no-systemd
So that log file alone grows to more than 1 GB per day, as a
consequence other logs like syslog grow as well.
But for some reason, storage08 is skipped, the mgr reports:
[cephadm DEBUG root] skipping apply of storage08 on
DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
This is the current hdd-only spec:
# ceph orch ls osd --export
service_type: osd
service_id: hdd-only
service_name: osd.hdd-only
placement:
hosts:
- storage06
- storage07
- storage08
- storage09
- storage10
spec:
data_devices:
rotational: 1
size: '1T:'
filter_logic: AND
objectstore: bluestore
I verified that all disks are deployed as OSDs, so there's no orphan
lying around or anything. I failed the mgr (of course), rebooted one
host because storage08 was recently rebooted as well. I don't know
for how long this has been going on, unfortunately.
So I started to look at the code [0], [1], it seems like the mgr
cache is not properly updated:
if not self.mgr.cache.osdspec_needs_apply(host, drive_group):
self.mgr.log.debug("skipping apply of %s on %s (no
change)" % (
host, drive_group))
So I looked at all these values:
def osdspec_needs_apply(self, host: str, spec: ServiceSpec) -> bool:
if (
host not in self.devices
or host not in self.last_device_change
or host not in self.last_device_update
or host not in self.osdspec_last_applied
or spec.service_name() not in self.osdspec_last_applied[host]
):
but all keys are populated with similar values as on storage08, for example:
root@storage01:~# ceph config-key get mgr/cephadm/host.storage10 |
jq -r '.last_device_change,.last_device_update,.osdspec_last_applied'
2025-02-12T16:27:21.979015Z
2025-07-17T05:35:18.852618Z
{
"osd.hdd-only": "2025-07-17T06:03:38.860971Z"
}
root@storage01:~# ceph config-key get mgr/cephadm/host.storage08 |
jq -r '.last_device_change,.last_device_update,.osdspec_last_applied'
2025-03-11T08:23:02.851969Z
2025-07-17T05:43:47.521682Z
{
"osd.hdd-only": "2025-03-11T08:23:21.494004Z"
}
Can anyone make sense of it? I'd appreciate any pointers!
Thanks!
Eugen
[0]
https://github.com/ceph/ceph/blob/v18.2.7/src/pybind/mgr/cephadm/services/osd.py#L42
[1]
https://github.com/ceph/ceph/blob/v18.2.7/src/pybind/mgr/cephadm/inventory.py#L1316
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io