It's rather dense. I get 2 very long lines which come from
mgr/cephadm/host.dell02.mousetech.com and
mgr/cephadm/host.ceph08.mousetech.com
The ceph08 entry references the prometheus on ceph08 and mgr and
ceph-exporter on dell02.
The dell02 entry references the container image for prometheus node.
On 3/29/25 05:13, Eugen Block wrote:
How about this:
ceph config-key dump | grep -v history
Can you spot any key regarding dell02 that doesn't belong there?
Zitat von Tim Holloway <t...@mousetech.com>:
Only the stuff that defines the rgw daemon on dell02.
On 3/28/25 19:23, Eugen Block wrote:
Do you find anything related to dell02 in config dump?
ceph config dump | grep -C2 dell02
Zitat von Tim Holloway <t...@mousetech.com>:
I'm guessing that the configuration issues come from the dashboard
wanting the prometheus API at 9505, versus prometheus itself on 9283.
Regardless, that didn't fix the message. As far as I can tell, the
"service not known" is coming from something trying to contact
prometheus on host dell02, and dell02 isn't running prometheus,
since the way I got it working at all was to do a generic "ceph
orch apply" without arguments. The problem being that apparently
the YAML config is still lurking around in the background even
though it never spawned the requested instances.
What I'll probably do is remove prometheus, try an "orch apply"
with arguments on the command line (deploy 2 hosts to ceph02 and
dell02), see what works or breaks, and if that succeeds, try again
with the YAML. Which won't totally prove everything's fixed, since
it could be drawing on the hidden stuff that won't go away, but at
least it would make it superficially clean.
On 3/28/25 18:52, Eugen Block wrote:
There's still some misconfiguration, it appears. It can be
confusing, but one thing is the mgr module "prometheus" which
provides additional cluster data, it runs on default port 9283.
The other is "prometheus server", which collects all the provided
cluster data, typically on port 9095.
So these two configs are wrong:
ceph config set mgr mgr/prometheus/server_addr 10.0.1.58
ceph config set mgr mgr/prometheus/server_port 9095
Those should be:
ceph config get mgr mgr/prometheus/server_addr
0.0.0.0
ceph config get mgr mgr/prometheus/server_port
9283
I assume that's why the module is still failing. Can you give that
a try and report back?
Zitat von Tim Holloway <t...@mousetech.com>:
OK. I didn't realize I'd pasted the wrong orch ls output. Yes,
it's "1/1" and has been for an hour. And yes, I did mis-type the
port.
The final complaint appears to arise from something looking for
prometheus at the failed deployment location dell02. The
dashboard says:
The mgr/prometheus module at dell02.mousetech.com:9095 is
unreachable.
This could mean that the module has been disabled or the mgr daemon
itself is down. Without the mgr/prometheus module metrics and alerts
will no longer ...
ceph mgr services shows:
# ceph mgr services
{
"dashboard":"https://10.0.1.52:8443/",
"prometheus":"http://10.0.1.58:9095/"
}
On 3/28/25 09:55, Eugen Block wrote:
Hi,
Since ceph orch ls wouldn't tell me /where/ the new prometheus
was deployed, I used the hosts tab in the dashboard to find it.
ceph orch ps --daemon-type prometheus
would show you where it tried to place the daemon.
So prometheus is now actually up and running? Just to confirm
because you pasted the output of 'ceph orch ls' when it wasn't
(yet?).
ceph dashboard set-prometheus-api-port 909
Is this a c&p mistake or did you actually miss the 5 here?
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known')
This seems to be a DNS issue, both relevant places in the code
for "gaierror" point to either
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/pybind/mgr/cephadm/utils.py#L141
or
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/python-common/ceph/deployment/utils.py#L58
where one tries a "_dns_lookup" and the other a "resolve_ip".
What does 'ceph mgr services' show?
Zitat von Tim Holloway <t...@mousetech.com>:
OK! Success of a sort.
I removed and re-installed each of the failed services in turn
using the "ceph orch rm" command followed by "ceph orch apply".
They came up with default settings (1 server), but they did
come up.
Finally, I tried it with prometheus. This gave me:
prometheus ?:9095 0/1 - 10s
count:1
However, in order for the dashboard to be happy, I had to
supply more info. Since ceph orch ls wouldn't tell me /where/
the new prometheus was deployed, I used the hosts tab in the
dashboard to find it.
Following that, I had to set the following:
ceph config set mgr mgr/prometheus/server_addr 10.0.1.58
ceph config set mgr mgr/prometheus/server_port 9095
ceph dashboard set-prometheus-api-host 10.0.1.58 (ceph08)
ceph dashboard set-prometheus-api-port 909
Once all of the above were set, the dashboard stopped
complaining about being able to access the prometheus API.
However, one last wart remains. Despite being up and running
(and confirmed listening on ceph09 port 9095), I do get this:
# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Name
or service not known'); too many PGs per OSD (648 > max 560)
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or
service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
On 3/28/25 08:53, Tim Holloway wrote:
Actually, I did deploy a new mds node yesterday. But I
followed your instructions and successfully removed and
re-installed ceph-exporter (4 nodes). So that part works.
On 3/28/25 07:28, Eugen Block wrote:
Okay, next I would keep prometheus disabled to see if the mgr
works properly. So disable the module again, and also reset
the dashboard setting to an empty value:
ceph dashboard reset-prometheus-api-host
Then see if you get an mds daemon deployed. Or test it by
removing and redeploying ceph-exporter or crash or something,
anything to test if the mgr is able to remove and deploy
other services.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for the info on removing stubborn dead OSDs. The
actual syntax required was:
cephadm rm-daemon --name osd.2 --fsid <fsid>--force
On the "too many pgs", that's because I'm down 2 OSDs. I've
got new drives, but they were waiting to clear out the dead
stuff. I know it's risky, but I have backups.
Recall that the start of this thread was on a HEALTH_OK
system and prometheus was not activating. The OSD stuff was
just a distraction.
I did notice that the attempt to add a new mds did work
after I did a "ceph mgr fail", so it's only prometheus
that's a permanent problem.
Here's the latest health after clearing out the dead OSDs:
# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2,
'Name or service not known'); too many PGs per OSD (648 >
max 560)
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or
service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
And yes, disabling prometheus will make the "name or service
not known" errors go away.
On 3/28/25 02:49, Eugen Block wrote:
Did you disable the prometheus module? I would expect the
warning to clear if you did.
Somewhere deep inside ceph, those deleted OSDs still
exist. Likely because ceph08 hasn't deleted the systemd
units that run them.
Or do you still see those OSDs in 'cephadm ls' output on
ceph08? If you do, and if those OSDs are really already
drained/purged, you can remove them with 'cephadm rm-daemon
--name osd.2'. And I would try to get the MGR into a
working state first, before you try to deploy prometheus
again. So my recommendation is to get into HEALTH_OK first.
And btw, "TOO_MANY_PGS: too many PGs per OSD (648 > max
560)" is serious, you can end up with inactive PGs during
recovery, so I'd also consider checking the pools and their
PGs.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for your patience.
host ceph06 isn't referenced in the config database. I
think I've finally purged it. I also reset the dashboard
API host address from ceph08 to dell02. But since
prometheus isn't running on dell02 either, there's no gain
there.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happening
at the base OS level that keep host ceph08 from running
its OSDs anymore. At boot, device /dev/sdb suddenly
changes to /dev/sdd (????) and there seem to be I/O
errors. It's really strange, but I'm going to replace the
physical drive and that will hopefully cure that.
The problem is, reef and earlier releases seem to have a
lot of trouble in deleting OSDs that aren't running. As
I've noted before, they tend to get permanently stuck in
the "deleting" state. When I cannot restart the OSD, the
only cure for that has been to run around the system and
apply brute force until things clear up.
I did a dashboard purge of the OSDs on ceph08 and that
removed them from the GUI (they'd already drained). I also
banged on things until I got them out of the OSD tree
display and then did a crush delete on host ceph08. And,
incidentally, the OSD tree works on simple host names, not
FQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add new
OSDs to ceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus'
has failed: gaierror(-2, 'Name or service not known'); too
many PGs per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon osd.2 on ceph08.internal.mousetech.com is in
error state
daemon osd.4 on ceph08.internal.mousetech.com is in
error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or
service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs still
exist. Likely because ceph08 hasn't deleted the systemd
units that run them.
I'm going to try removing/re-installing prometheus. since
it's now showing up in ceph health. I think last time I
had zombie OSDs I had to brute-force delete their
corresponding directories under /var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rm
mgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io