[ceph-users] Re: Prometheus anomaly in Reef

Tim Holloway Fri, 28 Mar 2025 06:59:18 -0700

OK! Success of a sort.

I removed and re-installed each of the failed services in turn using the"ceph orch rm" command followed by "ceph orch apply". They came up withdefault settings (1 server), but they did come up.


Finally, I tried it with prometheus. This gave me:

prometheus ?:9095 0/1 - 10scount:1

However, in order for the dashboard to be happy, I had to supply moreinfo. Since ceph orch ls wouldn't tell me /where/ the new prometheus wasdeployed, I used the hosts tab in the dashboard to find it.


Following that, I had to set the following:

ceph config set mgr mgr/prometheus/server_addr 10.0.1.58

ceph config set mgr mgr/prometheus/server_port 9095

ceph dashboard set-prometheus-api-host 10.0.1.58 (ceph08)

ceph dashboard set-prometheus-api-port 909

Once all of the above were set, the dashboard stopped complaining aboutbeing able to access the prometheus API.

However, one last wart remains. Despite being up and running (andconfirmed listening on ceph09 port 9095), I do get this:


# ceph health detail

HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Name or servicenot known'); too many PGs per OSD (648 > max 560)[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name or service notknown')

[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)

On 3/28/25 08:53, Tim Holloway wrote:

Actually, I did deploy a new mds node yesterday. But I followed yourinstructions and successfully removed and re-installed ceph-exporter(4 nodes). So that part works.
On 3/28/25 07:28, Eugen Block wrote:
Okay, next I would keep prometheus disabled to see if the mgr worksproperly. So disable the module again, and also reset the dashboardsetting to an empty value:
ceph dashboard reset-prometheus-api-host
Then see if you get an mds daemon deployed. Or test it by removingand redeploying ceph-exporter or crash or something, anything to testif the mgr is able to remove and deploy other services.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for the info on removing stubborn dead OSDs. The actualsyntax required was:
cephadm rm-daemon --name osd.2 --fsid <fsid>--force
On the "too many pgs", that's because I'm down 2 OSDs. I've got newdrives, but they were waiting to clear out the dead stuff. I knowit's risky, but I have backups.
Recall that the start of this thread was on a HEALTH_OK system andprometheus was not activating. The OSD stuff was just a distraction.
I did notice that the attempt to add a new mds did work after I dida "ceph mgr fail", so it's only prometheus that's a permanent problem.
Here's the latest health after clearing out the dead OSDs:

# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known'); too many PGs per OSD (648 > max 560)[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name or servicenot known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
And yes, disabling prometheus will make the "name or service notknown" errors go away.
On 3/28/25 02:49, Eugen Block wrote:
Did you disable the prometheus module? I would expect the warningto clear if you did.
Somewhere deep inside ceph, those deleted OSDs still exist. Likelybecause ceph08 hasn't deleted the systemd units that run them.
Or do you still see those OSDs in 'cephadm ls' output on ceph08? Ifyou do, and if those OSDs are really already drained/purged, youcan remove them with 'cephadm rm-daemon --name osd.2'. And I wouldtry to get the MGR into a working state first, before you try todeploy prometheus again. So my recommendation is to get intoHEALTH_OK first. And btw, "TOO_MANY_PGS: too many PGs per OSD (648> max 560)" is serious, you can end up with inactive PGs duringrecovery, so I'd also consider checking the pools and their PGs.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for your patience.
host ceph06 isn't referenced in the config database. I think I'vefinally purged it. I also reset the dashboard API host addressfrom ceph08 to dell02. But since prometheus isn't running ondell02 either, there's no gain there.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happening at thebase OS level that keep host ceph08 from running its OSDs anymore.At boot, device /dev/sdb suddenly changes to /dev/sdd (????) andthere seem to be I/O errors. It's really strange, but I'm going toreplace the physical drive and that will hopefully cure that.
The problem is, reef and earlier releases seem to have a lot oftrouble in deleting OSDs that aren't running. As I've notedbefore, they tend to get permanently stuck in the "deleting"state. When I cannot restart the OSD, the only cure for that hasbeen to run around the system and apply brute force until thingsclear up.
I did a dashboard purge of the OSDs on ceph08 and that removedthem from the GUI (they'd already drained). I also banged onthings until I got them out of the OSD tree display and then did acrush delete on host ceph08. And, incidentally, the OSD tree workson simple host names, not FQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add new OSDs toceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' hasfailed: gaierror(-2, 'Name or service not known'); too many PGsper OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
    daemon osd.4 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name or servicenot known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs still exist. Likelybecause ceph08 hasn't deleted the systemd units that run them.
I'm going to try removing/re-installing prometheus. since it's nowshowing up in ceph health. I think last time I had zombie OSDs Ihad to brute-force delete their corresponding directories under/var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rmmgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prometheus anomaly in Reef

Reply via email to