[ceph-users] Re: Prometheus anomaly in Reef

Eugen Block Fri, 28 Mar 2025 00:26:30 -0700

Did you disable the prometheus module? I would expect the warning toclear if you did.

Somewhere deep inside ceph, those deleted OSDs still exist. Likelybecause ceph08 hasn't deleted the systemd units that run them.

Or do you still see those OSDs in 'cephadm ls' output on ceph08? Ifyou do, and if those OSDs are really already drained/purged, you canremove them with 'cephadm rm-daemon --name osd.2'. And I would try toget the MGR into a working state first, before you try to deployprometheus again. So my recommendation is to get into HEALTH_OK first.And btw, "TOO_MANY_PGS: too many PGs per OSD (648 > max 560)" isserious, you can end up with inactive PGs during recovery, so I'd alsoconsider checking the pools and their PGs.


Zitat von Tim Holloway <t...@mousetech.com>:

Thanks for your patience.
host ceph06 isn't referenced in the config database. I think I'vefinally purged it. I also reset the dashboard API host address fromceph08 to dell02. But since prometheus isn't running on dell02either, there's no gain there.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happening at the baseOS level that keep host ceph08 from running its OSDs anymore. Atboot, device /dev/sdb suddenly changes to /dev/sdd (????) and thereseem to be I/O errors. It's really strange, but I'm going to replacethe physical drive and that will hopefully cure that.
The problem is, reef and earlier releases seem to have a lot oftrouble in deleting OSDs that aren't running. As I've noted before,they tend to get permanently stuck in the "deleting" state. When Icannot restart the OSD, the only cure for that has been to runaround the system and apply brute force until things clear up.
I did a dashboard purge of the OSDs on ceph08 and that removed themfrom the GUI (they'd already drained). I also banged on things untilI got them out of the OSD tree display and then did a crush deleteon host ceph08. And, incidentally, the OSD tree works on simple hostnames, not FQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add new OSDs toceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' hasfailed: gaierror(-2, 'Name or service not known'); too many PGs perOSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
    daemon osd.4 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,'Name or service not known')
    Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs still exist. Likelybecause ceph08 hasn't deleted the systemd units that run them.
I'm going to try removing/re-installing prometheus. since it's nowshowing up in ceph health. I think last time I had zombie OSDs I hadto brute-force delete their corresponding directories under/var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rm mgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prometheus anomaly in Reef

Reply via email to