Hi,
Since ceph orch ls wouldn't tell me /where/ the new prometheus was
deployed, I used the hosts tab in the dashboard to find it.
ceph orch ps --daemon-type prometheus
would show you where it tried to place the daemon.
So prometheus is now actually up and running? Just to confirm because
you pasted the output of 'ceph orch ls' when it wasn't (yet?).
ceph dashboard set-prometheus-api-port 909
Is this a c&p mistake or did you actually miss the 5 here?
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,
'Name or service not known')
This seems to be a DNS issue, both relevant places in the code for
"gaierror" point to either
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/pybind/mgr/cephadm/utils.py#L141
or
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/python-common/ceph/deployment/utils.py#L58
where one tries a "_dns_lookup" and the other a "resolve_ip". What
does 'ceph mgr services' show?
Zitat von Tim Holloway <t...@mousetech.com>:
OK! Success of a sort.
I removed and re-installed each of the failed services in turn using
the "ceph orch rm" command followed by "ceph orch apply". They came
up with default settings (1 server), but they did come up.
Finally, I tried it with prometheus. This gave me:
prometheus ?:9095 0/1 -
10s count:1
However, in order for the dashboard to be happy, I had to supply
more info. Since ceph orch ls wouldn't tell me /where/ the new
prometheus was deployed, I used the hosts tab in the dashboard to
find it.
Following that, I had to set the following:
ceph config set mgr mgr/prometheus/server_addr 10.0.1.58
ceph config set mgr mgr/prometheus/server_port 9095
ceph dashboard set-prometheus-api-host 10.0.1.58 (ceph08)
ceph dashboard set-prometheus-api-port 909
Once all of the above were set, the dashboard stopped complaining
about being able to access the prometheus API.
However, one last wart remains. Despite being up and running (and
confirmed listening on ceph09 port 9095), I do get this:
# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Name or
service not known'); too many PGs per OSD (648 > max 560)
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,
'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
On 3/28/25 08:53, Tim Holloway wrote:
Actually, I did deploy a new mds node yesterday. But I followed
your instructions and successfully removed and re-installed
ceph-exporter (4 nodes). So that part works.
On 3/28/25 07:28, Eugen Block wrote:
Okay, next I would keep prometheus disabled to see if the mgr
works properly. So disable the module again, and also reset the
dashboard setting to an empty value:
ceph dashboard reset-prometheus-api-host
Then see if you get an mds daemon deployed. Or test it by removing
and redeploying ceph-exporter or crash or something, anything to
test if the mgr is able to remove and deploy other services.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for the info on removing stubborn dead OSDs. The actual
syntax required was:
cephadm rm-daemon --name osd.2 --fsid <fsid>--force
On the "too many pgs", that's because I'm down 2 OSDs. I've got
new drives, but they were waiting to clear out the dead stuff. I
know it's risky, but I have backups.
Recall that the start of this thread was on a HEALTH_OK system
and prometheus was not activating. The OSD stuff was just a
distraction.
I did notice that the attempt to add a new mds did work after I
did a "ceph mgr fail", so it's only prometheus that's a permanent
problem.
Here's the latest health after clearing out the dead OSDs:
# ceph health detail
HEALTH_ERR Module 'prometheus' has failed: gaierror(-2, 'Name or
service not known'); too many PGs per OSD (648 > max 560)
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or service
not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
And yes, disabling prometheus will make the "name or service not
known" errors go away.
On 3/28/25 02:49, Eugen Block wrote:
Did you disable the prometheus module? I would expect the
warning to clear if you did.
Somewhere deep inside ceph, those deleted OSDs still exist.
Likely because ceph08 hasn't deleted the systemd units that run
them.
Or do you still see those OSDs in 'cephadm ls' output on ceph08?
If you do, and if those OSDs are really already drained/purged,
you can remove them with 'cephadm rm-daemon --name osd.2'. And I
would try to get the MGR into a working state first, before you
try to deploy prometheus again. So my recommendation is to get
into HEALTH_OK first. And btw, "TOO_MANY_PGS: too many PGs per
OSD (648 > max 560)" is serious, you can end up with inactive
PGs during recovery, so I'd also consider checking the pools and
their PGs.
Zitat von Tim Holloway <t...@mousetech.com>:
Thanks for your patience.
host ceph06 isn't referenced in the config database. I think
I've finally purged it. I also reset the dashboard API host
address from ceph08 to dell02. But since prometheus isn't
running on dell02 either, there's no gain there.
I did clear some of that lint out via "ceph mgr fail".
So here's the latest. There are strange things happening at the
base OS level that keep host ceph08 from running its OSDs
anymore. At boot, device /dev/sdb suddenly changes to /dev/sdd
(????) and there seem to be I/O errors. It's really strange,
but I'm going to replace the physical drive and that will
hopefully cure that.
The problem is, reef and earlier releases seem to have a lot of
trouble in deleting OSDs that aren't running. As I've noted
before, they tend to get permanently stuck in the "deleting"
state. When I cannot restart the OSD, the only cure for that
has been to run around the system and apply brute force until
things clear up.
I did a dashboard purge of the OSDs on ceph08 and that removed
them from the GUI (they'd already drained). I also banged on
things until I got them out of the OSD tree display and then
did a crush delete on host ceph08. And, incidentally, the OSD
tree works on simple host names, not FQDNs like the rest of ceph!
So in theory, I'm ready to jack in new drives and add new OSDs
to ceph08. Except:
# ceph health detail
HEALTH_ERR 2 failed cephadm daemon(s); Module 'prometheus' has
failed: gaierror(-2, 'Name or service not known'); too many PGs
per OSD (648 > max 560)
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon osd.2 on ceph08.internal.mousetech.com is in error state
daemon osd.4 on ceph08.internal.mousetech.com is in error state
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed:
gaierror(-2, 'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or
service not known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > max 560)
Somewhere deep inside ceph, those deleted OSDs still exist.
Likely because ceph08 hasn't deleted the systemd units that run
them.
I'm going to try removing/re-installing prometheus. since it's
now showing up in ceph health. I think last time I had zombie
OSDs I had to brute-force delete their corresponding
directories under /var/lib/ceph.
On 3/27/25 14:01, Eugen Block wrote:
ceph config-key rm
mgr/cephadm/host.ceph06.internal.mousetech.com.devices.0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io