[ceph-users] Re: Prometheus anomaly in Reef

Tim Holloway Sat, 29 Mar 2025 07:54:23 -0700

It's rather dense. I get 2 very long lines which come frommgr/cephadm/host.dell02.mousetech.com andmgr/cephadm/host.ceph08.mousetech.com

The ceph08 entry references the prometheus on ceph08 and mgr andceph-exporter on dell02.


The dell02 entry references the container image for prometheus node.

On 3/29/25 05:13, Eugen Block wrote:

How about this:

ceph config-key dump | grep -v history

Can you spot any key regarding dell02 that doesn't belong there?

Zitat von Tim Holloway <t...@mousetech.com>:
Only the stuff that defines the rgw daemon on dell02.

On 3/28/25 19:23, Eugen Block wrote:
Do you find anything related to dell02 in config dump?

ceph config dump | grep -C2 dell02

Zitat von Tim Holloway <t...@mousetech.com>:
I'm guessing that the configuration issues come from the dashboardwanting the prometheus API at 9505, versus prometheus itself on 9283.
Regardless, that didn't fix the message. As far as I can tell, the"service not known" is coming from something trying to contactprometheus on host dell02, and dell02 isn't running prometheus,since the way I got it working at all was to do a generic "cephorch apply" without arguments. The problem being that apparentlythe YAML config is still lurking around in the background eventhough it never spawned the requested instances.
What I'll probably do is remove prometheus, try an "orch apply"with arguments on the command line (deploy 2 hosts to ceph02 anddell02), see what works or breaks, and if that succeeds, try againwith the YAML. Which won't totally prove everything's fixed, sinceit could be drawing on the hidden stuff that won't go away, but atleast it would make it superficially clean.
On 3/28/25 18:52, Eugen Block wrote:
There's still some misconfiguration, it appears. It can beconfusing, but one thing is the mgr module "prometheus" whichprovides additional cluster data, it runs on default port 9283.The other is "prometheus server", which collects all the providedcluster data, typically on port 9095.
So these two configs are wrong:

ceph config set mgr mgr/prometheus/server_addr 10.0.1.58
ceph config set mgr mgr/prometheus/server_port 9095

Those should be:

ceph config get mgr mgr/prometheus/server_addr
0.0.0.0
ceph config get mgr mgr/prometheus/server_port
9283
I assume that's why the module is still failing. Can you give thata try and report back?
Zitat von Tim Holloway <t...@mousetech.com>:
OK. I didn't realize I'd pasted the wrong orch ls output. Yes,it's "1/1" and has been for an hour. And yes, I did mis-type theport.
The final complaint appears to arise from something looking forprometheus at the failed deployment location dell02. Thedashboard says:
The mgr/prometheus module at dell02.mousetech.com:9095 isunreachable.
This could mean that the module has been disabled or the mgr daemon
itself is down. Without the mgr/prometheus module metrics and alerts
will no longer ...

ceph mgr services shows:

# ceph mgr services
{
    "dashboard":"https://10.0.1.52:8443/";,
    "prometheus":"http://10.0.1.58:9095/";
}


On 3/28/25 09:55, Eugen Block wrote:
Hi,
Since ceph orch ls wouldn't tell me /where/ the new prometheuswas deployed, I used the hosts tab in the dashboard to find it.
ceph orch ps --daemon-type prometheus

would show you where it tried to place the daemon.
So prometheus is now actually up and running? Just to confirmbecause you pasted the output of 'ceph orch ls' when it wasn't(yet?).
ceph dashboard set-prometheus-api-port 909
Is this a c&p mistake or did you actually miss the 5 here?
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known')
This seems to be a DNS issue, both relevant places in the codefor "gaierror" point to either
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/pybind/mgr/cephadm/utils.py#L141or
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/python-common/ceph/deployment/utils.py#L58where one tries a "_dns_lookup" and the other a "resolve_ip".What does 'ceph mgr services' show?
Zitat von Tim Holloway <t...@mousetech.com>:
OK! Success of a sort.
I removed and re-installed each of the failed services in turnusing the "ceph orch rm" command followed by "ceph orch apply".They came up with default settings (1 server), but they didcome up.
Finally, I tried it with prometheus. This gave me:
prometheus ?:9095 0/1 - 10scount:1
However, in order for the dashboard to be happy, I had tosupply more info. Since ceph orch ls wouldn't tell me /where/the new prometheus was deployed, I used the hosts tab in thedashboard to find it.
Following that, I had to set the following:

ceph config set mgr mgr/prometheus/server_addr 10.0.1.58

ceph config set mgr mgr/prometheus/server_port 9095

ceph dashboard set-prometheus-api-host 10.0.1.58 (ceph08)

ceph dashboard set-prometheus-api-port 909
Once all of the above were set, the dashboard stoppedcomplaining about being able to access the prometheus API.
However, one last wart remains. Despite being up and running(and confirmed listening on ceph09 port 9095), I do get this:
# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Nameor service not known'); too many PGs per OSD (648 > max 560)[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)

On 3/28/25 08:53, Tim Holloway wrote:
Actually, I did deploy a new mds node yesterday. But Ifollowed your instructions and successfully removed andre-installed ceph-exporter (4 nodes). So that part works.
On 3/28/25 07:28, Eugen Block wrote:
Okay, next I would keep prometheus disabled to see if the mgrworks properly. So disable the module again, and also resetthe dashboard setting to an empty value:
ceph dashboard reset-prometheus-api-host
Then see if you get an mds daemon deployed. Or test it byremoving and redeploying ceph-exporter or crash or something,anything to test if the mgr is able to remove and deployother services.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for the info on removing stubborn dead OSDs. Theactual syntax required was:
cephadm rm-daemon --name osd.2 --fsid <fsid>--force
On the "too many pgs", that's because I'm down 2 OSDs. I'vegot new drives, but they were waiting to clear out the deadstuff. I know it's risky, but I have backups.
Recall that the start of this thread was on a HEALTH_OKsystem and prometheus was not activating. The OSD stuff wasjust a distraction.
I did notice that the attempt to add a new mds did workafter I did a "ceph mgr fail", so it's only prometheusthat's a permanent problem.
Here's the latest health after clearing out the dead OSDs:

# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2,'Name or service not known'); too many PGs per OSD (648 >max 560)[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
And yes, disabling prometheus will make the "name or servicenot known" errors go away.
On 3/28/25 02:49, Eugen Block wrote:
Did you disable the prometheus module? I would expect thewarning to clear if you did.
Somewhere deep inside ceph, those deleted OSDs stillexist. Likely because ceph08 hasn't deleted the systemdunits that run them.
Or do you still see those OSDs in 'cephadm ls' output onceph08? If you do, and if those OSDs are really alreadydrained/purged, you can remove them with 'cephadm rm-daemon--name osd.2'. And I would try to get the MGR into aworking state first, before you try to deploy prometheusagain. So my recommendation is to get into HEALTH_OK first.And btw, "TOO_MANY_PGS: too many PGs per OSD (648 > max560)" is serious, you can end up with inactive PGs duringrecovery, so I'd also consider checking the pools and theirPGs.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for your patience.
host ceph06 isn't referenced in the config database. Ithink I've finally purged it. I also reset the dashboardAPI host address from ceph08 to dell02. But sinceprometheus isn't running on dell02 either, there's no gainthere.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happeningat the base OS level that keep host ceph08 from runningits OSDs anymore. At boot, device /dev/sdb suddenlychanges to /dev/sdd (????) and there seem to be I/Oerrors. It's really strange, but I'm going to replace thephysical drive and that will hopefully cure that.
The problem is, reef and earlier releases seem to have alot of trouble in deleting OSDs that aren't running. AsI've noted before, they tend to get permanently stuck inthe "deleting" state. When I cannot restart the OSD, theonly cure for that has been to run around the system andapply brute force until things clear up.
I did a dashboard purge of the OSDs on ceph08 and thatremoved them from the GUI (they'd already drained). I alsobanged on things until I got them out of the OSD treedisplay and then did a crush delete on host ceph08. And,incidentally, the OSD tree works on simple host names, notFQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add newOSDs to ceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus'has failed: gaierror(-2, 'Name or service not known'); toomany PGs per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon osd.2 on ceph08.internal.mousetech.com is inerror state daemon osd.4 on ceph08.internal.mousetech.com is inerror state[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:gaierror(-2, 'Name or service not known') Module 'prometheus' has failed: gaierror(-2, 'Name orservice not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs stillexist. Likely because ceph08 hasn't deleted the systemdunits that run them.
I'm going to try removing/re-installing prometheus. sinceit's now showing up in ceph health. I think last time Ihad zombie OSDs I had to brute-force delete theircorresponding directories under /var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rmmgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Prometheus anomaly in Reef

Reply via email to