Thanks for your patience.
host ceph06 isn't referenced in the config database. I think I've
finally purged it. I also reset the dashboard API host address from
ceph08 to dell02. But since prometheus isn't running on dell02 either,
there's no gain there.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happening at the base OS
level that keep host ceph08 from running its OSDs anymore. At boot,
device /dev/sdb suddenly changes to /dev/sdd (????) and there seem to be
I/O errors. It's really strange, but I'm going to replace the physical
drive and that will hopefully cure that.
The problem is, reef and earlier releases seem to have a lot of trouble
in deleting OSDs that aren't running. As I've noted before, they tend to
get permanently stuck in the "deleting" state. When I cannot restart the
OSD, the only cure for that has been to run around the system and apply
brute force until things clear up.
I did a dashboard purge of the OSDs on ceph08 and that removed them from
the GUI (they'd already drained). I also banged on things until I got
them out of the OSD tree display and then did a crush delete on host
ceph08. And, incidentally, the OSD tree works on simple host names, not
FQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add new OSDs to
ceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known'); too many PGs per OSD (648 >
max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon osd.2 on ceph08.internal.mousetech.com is in error state
daemon osd.4 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,
'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or service not
known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs still exist. Likely
because ceph08 hasn't deleted the systemd units that run them.
I'm going to try removing/re-installing prometheus. since it's now
showing up in ceph health. I think last time I had zombie OSDs I had to
brute-force delete their corresponding directories under /var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rm
mgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io