OK. Found it.

The primary prometheus node had a bad /etc/hosts.

Most of my ceph nodes are on their own sub-domain, but a few have legacy domain names and the uncontacted node is one of them. Since I have wildcard resolution on DNS, Ceph was polling the wrong machine instead of failing to resolve outright and it wasn't obvious because the /etc/hosts on the other ceph nodes was set properly.

The odd thing was, I thought I'd addressed that a while back, and as I said things HAD been working. But when I started tracking into the inner workings, I found that the entry had apparently reverted despite the fact that I've done no further maintenance on them. I'll double-check my master provisioner and if the problem comes back at least I'll know to look more carefully. Pity the dashboard error doesn't include the failing IP address along with the URL.

Anyway, thanks all for the help!

   Tim

On 4/21/25 09:46, Tim Holloway wrote:
Thanks, but all I'm getting is the following every 10 minutes from the prometheus nodes:

Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21 09:29:32.252358201 -0400 EDT m=+0.039016913 container exec 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66 (image=quay.io/prometheus/node-exporter:v1.5.0, name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02, maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>) Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21 09:29:32.259543374 -0400 EDT m=+0.046202087 container exec_died 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66 (image=quay.io/prometheus/node-exporter:v1.5.0, name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02, maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>) Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21 09:29:32.947995363 -0400 EDT m=+0.036761633 container exec 71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c (image=quay.io/prometheus/prometheus:v2.43.0, name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02, maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>) Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21 09:29:32.979516297 -0400 EDT m=+0.068282565 container exec_died 71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c (image=quay.io/prometheus/prometheus:v2.43.0, name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02, maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>)

It looks like a process is failing inside the prometheus container, but the container itself remains in operation.

On another note, I was all green until 2 days ago when a node went into overload. The node has recovered, but the prometheus alert is back and has been for 2 days.

   Tim


On 4/21/25 07:25, Ernesto Puerta wrote:
You could check Alertmanager container logs
<https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald>
.

Kind Regards,
Ernesto


On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway <t...@mousetech.com> wrote:

I'm thinking more some sort of latency error.

I have 2 prometheus daemons running at the moment. The hosts files on
all my ceph servers contain both hostname and FQDN.

This morning the alert was gone. I don't know where I might find a log
of when it comes and goes, but all was clean, then it wasn't, now it's
clean again and I haven't been playing with any sort of configurations
or bouncing hosts or services. It's just appearing and disappearing.

     Tim

On 4/16/25 09:34, Ankush Behl wrote:
Just to add upon what Ernesto mentioned. Your prometheus container
might not be able to reachout to ceph scrape job as the it could be using
FQDN or Hostname. Try updating /etc/hosts with ip and hostname of the
ceph scrape job(you can find it on prometheus UI -> status -> targets)
and
restarting the prometheus after that might help resolve the issue.

On Wed, Apr 16, 2025 at 2:10 PM Ernesto Puerta <epuer...@redhat.com>
wrote:
Don't shoot the messenger. Dashboard is just displaying the alert that
Prometheus/AlertManager is reporting. The alert definition is here
<

https://github.com/ceph/ceph/blob/3993779cde9d10512f4a26f87487d11103ac1bd0/monitoring/ceph-mixin/prometheus_alerts.yml#L342-L351
.
As you may see, it's based on the status of the Prometheus "ceph" scrape
job. This alert is vital, because if the "ceph" job is not scraping
metrics
from the "mgr/prometheus" module, no other Ceph alert condition will be
detected, therefore creating a false sense of confidence.

You may start having a look at Prometheus and/or Alertmanager web UIs,
or
checking their logs.

Kind Regards,
Ernesto


On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway <t...@mousetech.com>
wrote:
Although I've had this problem since at least Pacific, I'm still seeing
it on Reef.

After much pain and suffering (covered elsewhere), I got my Prometheus services deployed as intended, Ceph health OK, green across the board.

However, over the weekend, the dreaded
"CephMgrPrometheusModuleInactive" alert has returned to the Dashboard.
"The mgr/prometheus module at dell02.mousetech.com:9283 is
unreachable."

It's a blatant lie.

I still get "Ceph HEALTH_OK". All monitor status command show
everything running. Checking ports on the host says it's listening.

More to the point, I can send my desktop browser to
http://dell02.mousetech.com:9283 and get a page that will allow me to
see the metrics. So everyone can see it but the Dashboard!

I did have some issues when the other prometheus host couldn't resolve the hostname, but I fixed that for all ceph hosts and it was green for
days. Now the error is back. Restarting Prometheus didn't help.

How is the Dashboard hallucinating this???

     Tim
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to