[ceph-users] Re: Prometheus anomaly in Reef

Eugen Block Fri, 28 Mar 2025 15:54:01 -0700

There's still some misconfiguration, it appears. It can be confusing,but one thing is the mgr module "prometheus" which provides additionalcluster data, it runs on default port 9283. The other is "prometheusserver", which collects all the provided cluster data, typically onport 9095.


So these two configs are wrong:


ceph config set mgr mgr/prometheus/server_addr 10.0.1.58
ceph config set mgr mgr/prometheus/server_port 9095

Those should be:

ceph config get mgr mgr/prometheus/server_addr
0.0.0.0
ceph config get mgr mgr/prometheus/server_port
9283

I assume that's why the module is still failing. Can you give that atry and report back?


Zitat von Tim Holloway <t...@mousetech.com>:

OK. I didn't realize I'd pasted the wrong orch ls output. Yes, it's"1/1" and has been for an hour. And yes, I did mis-type the port.
The final complaint appears to arise from something looking forprometheus at the failed deployment location dell02. The dashboardsays:
The mgr/prometheus module at dell02.mousetech.com:9095 is unreachable.
This could mean that the module has been disabled or the mgr daemon
itself is down. Without the mgr/prometheus module metrics and alerts
will no longer ...

ceph mgr services shows:

# ceph mgr services
{
    "dashboard":"https://10.0.1.52:8443/";,
    "prometheus":"http://10.0.1.58:9095/";
}


On 3/28/25 09:55, Eugen Block wrote:
Hi,
Since ceph orch ls wouldn't tell me /where/ the new prometheus wasdeployed, I used the hosts tab in the dashboard to find it.
ceph orch ps --daemon-type prometheus

would show you where it tried to place the daemon.
So prometheus is now actually up and running? Just to confirmbecause you pasted the output of 'ceph orch ls' when it wasn't(yet?).
ceph dashboard set-prometheus-api-port 909
Is this a c&p mistake or did you actually miss the 5 here?
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known')
This seems to be a DNS issue, both relevant places in the code for"gaierror" point to either
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/pybind/mgr/cephadm/utils.py#L141or
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/python-common/ceph/deployment/utils.py#L58 where one tries a "_dns_lookup" and the other a "resolve_ip". What does 'ceph mgr services'show?
Zitat von Tim Holloway <t...@mousetech.com>:
OK! Success of a sort.
I removed and re-installed each of the failed services in turnusing the "ceph orch rm" command followed by "ceph orch apply".They came up with default settings (1 server), but they did come up.
Finally, I tried it with prometheus. This gave me:
prometheus ?:9095 0/1 - 10s count:1
However, in order for the dashboard to be happy, I had to supplymore info. Since ceph orch ls wouldn't tell me /where/ the newprometheus was deployed, I used the hosts tab in the dashboard tofind it.
Following that, I had to set the following:

ceph config set mgr mgr/prometheus/server_addr 10.0.1.58

ceph config set mgr mgr/prometheus/server_port 9095

ceph dashboard set-prometheus-api-host 10.0.1.58 (ceph08)

ceph dashboard set-prometheus-api-port 909
Once all of the above were set, the dashboard stopped complainingabout being able to access the prometheus API.
However, one last wart remains. Despite being up and running (andconfirmed listening on ceph09 port 9095), I do get this:
# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known'); too many PGs per OSD (648 > max 560)[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name or servicenot known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)

On 3/28/25 08:53, Tim Holloway wrote:
Actually, I did deploy a new mds node yesterday. But I followedyour instructions and successfully removed and re-installedceph-exporter (4 nodes). So that part works.
On 3/28/25 07:28, Eugen Block wrote:
Okay, next I would keep prometheus disabled to see if the mgrworks properly. So disable the module again, and also reset thedashboard setting to an empty value:
ceph dashboard reset-prometheus-api-host
Then see if you get an mds daemon deployed. Or test it byremoving and redeploying ceph-exporter or crash or something,anything to test if the mgr is able to remove and deploy otherservices.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for the info on removing stubborn dead OSDs. The actualsyntax required was:
cephadm rm-daemon --name osd.2 --fsid <fsid>--force
On the "too many pgs", that's because I'm down 2 OSDs. I've gotnew drives, but they were waiting to clear out the dead stuff.I know it's risky, but I have backups.
Recall that the start of this thread was on a HEALTH_OK systemand prometheus was not activating. The OSD stuff was just adistraction.
I did notice that the attempt to add a new mds did work after Idid a "ceph mgr fail", so it's only prometheus that's apermanent problem.
Here's the latest health after clearing out the dead OSDs:

# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Nameor service not known'); too many PGs per OSD (648 > max 560)[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
And yes, disabling prometheus will make the "name or servicenot known" errors go away.
On 3/28/25 02:49, Eugen Block wrote:
Did you disable the prometheus module? I would expect thewarning to clear if you did.
Somewhere deep inside ceph, those deleted OSDs still exist.Likely because ceph08 hasn't deleted the systemd units thatrun them.
Or do you still see those OSDs in 'cephadm ls' output onceph08? If you do, and if those OSDs are really alreadydrained/purged, you can remove them with 'cephadm rm-daemon--name osd.2'. And I would try to get the MGR into a workingstate first, before you try to deploy prometheus again. So myrecommendation is to get into HEALTH_OK first. And btw,"TOO_MANY_PGS: too many PGs per OSD (648 > max 560)" isserious, you can end up with inactive PGs during recovery, soI'd also consider checking the pools and their PGs.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for your patience.
host ceph06 isn't referenced in the config database. I thinkI've finally purged it. I also reset the dashboard API hostaddress from ceph08 to dell02. But since prometheus isn'trunning on dell02 either, there's no gain there.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happening atthe base OS level that keep host ceph08 from running its OSDsanymore. At boot, device /dev/sdb suddenly changes to/dev/sdd (????) and there seem to be I/O errors. It's reallystrange, but I'm going to replace the physical drive and thatwill hopefully cure that.
The problem is, reef and earlier releases seem to have a lotof trouble in deleting OSDs that aren't running. As I'venoted before, they tend to get permanently stuck in the"deleting" state. When I cannot restart the OSD, the onlycure for that has been to run around the system and applybrute force until things clear up.
I did a dashboard purge of the OSDs on ceph08 and thatremoved them from the GUI (they'd already drained). I alsobanged on things until I got them out of the OSD tree displayand then did a crush delete on host ceph08. And,incidentally, the OSD tree works on simple host names, notFQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add newOSDs to ceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus'has failed: gaierror(-2, 'Name or service not known'); toomany PGs per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.2 on ceph08.internal.mousetech.com is in error state
    daemon osd.4 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs still exist.Likely because ceph08 hasn't deleted the systemd units thatrun them.
I'm going to try removing/re-installing prometheus. sinceit's now showing up in ceph health. I think last time I hadzombie OSDs I had to brute-force delete their correspondingdirectories under /var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rmmgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prometheus anomaly in Reef

Reply via email to