OK. Found it.
The primary prometheus node had a bad /etc/hosts.
Most of my ceph nodes are on their own sub-domain, but a few have legacy
domain names and the uncontacted node is one of them. Since I have
wildcard resolution on DNS, Ceph was polling the wrong machine instead
of failing to resolve outright and it wasn't obvious because the
/etc/hosts on the other ceph nodes was set properly.
The odd thing was, I thought I'd addressed that a while back, and as I
said things HAD been working. But when I started tracking into the inner
workings, I found that the entry had apparently reverted despite the
fact that I've done no further maintenance on them. I'll double-check my
master provisioner and if the problem comes back at least I'll know to
look more carefully. Pity the dashboard error doesn't include the
failing IP address along with the URL.
Anyway, thanks all for the help!
Tim
On 4/21/25 09:46, Tim Holloway wrote:
Thanks, but all I'm getting is the following every 10 minutes from the
prometheus nodes:
Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21
09:29:32.252358201 -0400 EDT m=+0.039016913 container exec
2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66
(image=quay.io/prometheus/node-exporter:v1.5.0,
name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,
maintainer=The Prometheus Authors
<prometheus-develop...@googlegroups.com>)
Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21
09:29:32.259543374 -0400 EDT m=+0.046202087 container exec_died
2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66
(image=quay.io/prometheus/node-exporter:v1.5.0,
name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,
maintainer=The Prometheus Authors
<prometheus-develop...@googlegroups.com>)
Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21
09:29:32.947995363 -0400 EDT m=+0.036761633 container exec
71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c
(image=quay.io/prometheus/prometheus:v2.43.0,
name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,
maintainer=The Prometheus Authors
<prometheus-develop...@googlegroups.com>)
Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21
09:29:32.979516297 -0400 EDT m=+0.068282565 container exec_died
71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c
(image=quay.io/prometheus/prometheus:v2.43.0,
name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,
maintainer=The Prometheus Authors
<prometheus-develop...@googlegroups.com>)
It looks like a process is failing inside the prometheus container,
but the container itself remains in operation.
On another note, I was all green until 2 days ago when a node went
into overload. The node has recovered, but the prometheus alert is
back and has been for 2 days.
Tim
On 4/21/25 07:25, Ernesto Puerta wrote:
You could check Alertmanager container logs
<https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald>
.
Kind Regards,
Ernesto
On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway <t...@mousetech.com> wrote:
I'm thinking more some sort of latency error.
I have 2 prometheus daemons running at the moment. The hosts files on
all my ceph servers contain both hostname and FQDN.
This morning the alert was gone. I don't know where I might find a log
of when it comes and goes, but all was clean, then it wasn't, now it's
clean again and I haven't been playing with any sort of configurations
or bouncing hosts or services. It's just appearing and disappearing.
Tim
On 4/16/25 09:34, Ankush Behl wrote:
Just to add upon what Ernesto mentioned. Your prometheus container
might not be able to reachout to ceph scrape job as the it could be
using
FQDN or Hostname. Try updating /etc/hosts with ip and hostname of the
ceph scrape job(you can find it on prometheus UI -> status -> targets)
and
restarting the prometheus after that might help resolve the issue.
On Wed, Apr 16, 2025 at 2:10 PM Ernesto Puerta <epuer...@redhat.com>
wrote:
Don't shoot the messenger. Dashboard is just displaying the alert
that
Prometheus/AlertManager is reporting. The alert definition is here
<
https://github.com/ceph/ceph/blob/3993779cde9d10512f4a26f87487d11103ac1bd0/monitoring/ceph-mixin/prometheus_alerts.yml#L342-L351
.
As you may see, it's based on the status of the Prometheus "ceph"
scrape
job. This alert is vital, because if the "ceph" job is not scraping
metrics
from the "mgr/prometheus" module, no other Ceph alert condition
will be
detected, therefore creating a false sense of confidence.
You may start having a look at Prometheus and/or Alertmanager web
UIs,
or
checking their logs.
Kind Regards,
Ernesto
On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway <t...@mousetech.com>
wrote:
Although I've had this problem since at least Pacific, I'm still
seeing
it on Reef.
After much pain and suffering (covered elsewhere), I got my
Prometheus
services deployed as intended, Ceph health OK, green across the
board.
However, over the weekend, the dreaded
"CephMgrPrometheusModuleInactive" alert has returned to the
Dashboard.
"The mgr/prometheus module at dell02.mousetech.com:9283 is
unreachable."
It's a blatant lie.
I still get "Ceph HEALTH_OK". All monitor status command show
everything running. Checking ports on the host says it's listening.
More to the point, I can send my desktop browser to
http://dell02.mousetech.com:9283 and get a page that will allow
me to
see the metrics. So everyone can see it but the Dashboard!
I did have some issues when the other prometheus host couldn't
resolve
the hostname, but I fixed that for all ceph hosts and it was
green for
days. Now the error is back. Restarting Prometheus didn't help.
How is the Dashboard hallucinating this???
Tim
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io