[ceph-users] Re: Dashboard lies to me

Tim Holloway Mon, 21 Apr 2025 09:30:44 -0700

OK. Found it.

The primary prometheus node had a bad /etc/hosts.

Most of my ceph nodes are on their own sub-domain, but a few have legacydomain names and the uncontacted node is one of them. Since I havewildcard resolution on DNS, Ceph was polling the wrong machine insteadof failing to resolve outright and it wasn't obvious because the/etc/hosts on the other ceph nodes was set properly.

The odd thing was, I thought I'd addressed that a while back, and as Isaid things HAD been working. But when I started tracking into the innerworkings, I found that the entry had apparently reverted despite thefact that I've done no further maintenance on them. I'll double-check mymaster provisioner and if the problem comes back at least I'll know tolook more carefully. Pity the dashboard error doesn't include thefailing IP address along with the URL.


Anyway, thanks all for the help!

   Tim

On 4/21/25 09:46, Tim Holloway wrote:

Thanks, but all I'm getting is the following every 10 minutes from theprometheus nodes:
Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-2109:29:32.252358201 -0400 EDT m=+0.039016913 container exec2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66(image=quay.io/prometheus/node-exporter:v1.5.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,maintainer=The Prometheus Authors<prometheus-develop...@googlegroups.com>)Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-2109:29:32.259543374 -0400 EDT m=+0.046202087 container exec_died2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66(image=quay.io/prometheus/node-exporter:v1.5.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,maintainer=The Prometheus Authors<prometheus-develop...@googlegroups.com>)Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-2109:29:32.947995363 -0400 EDT m=+0.036761633 container exec71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c(image=quay.io/prometheus/prometheus:v2.43.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,maintainer=The Prometheus Authors<prometheus-develop...@googlegroups.com>)Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-2109:29:32.979516297 -0400 EDT m=+0.068282565 container exec_died71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c(image=quay.io/prometheus/prometheus:v2.43.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,maintainer=The Prometheus Authors<prometheus-develop...@googlegroups.com>)
It looks like a process is failing inside the prometheus container,but the container itself remains in operation.
On another note, I was all green until 2 days ago when a node wentinto overload. The node has recovered, but the prometheus alert isback and has been for 2 days.
   Tim


On 4/21/25 07:25, Ernesto Puerta wrote:
You could check Alertmanager container logs
<https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald>
.

Kind Regards,
Ernesto


On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway <t...@mousetech.com> wrote:
I'm thinking more some sort of latency error.

I have 2 prometheus daemons running at the moment. The hosts files on
all my ceph servers contain both hostname and FQDN.

This morning the alert was gone. I don't know where I might find a log
of when it comes and goes, but all was clean, then it wasn't, now it's
clean again and I haven't been playing with any sort of configurations
or bouncing hosts or services. It's just appearing and disappearing.

     Tim

On 4/16/25 09:34, Ankush Behl wrote:
Just to add upon what Ernesto mentioned. Your prometheus container
might not be able to reachout to ceph scrape job as the it could beusing
FQDN or Hostname. Try updating /etc/hosts with ip and hostname of the
ceph scrape job(you can find it on prometheus UI -> status -> targets)
and
restarting the prometheus after that might help resolve the issue.

On Wed, Apr 16, 2025 at 2:10 PM Ernesto Puerta <epuer...@redhat.com>
wrote:
Don't shoot the messenger. Dashboard is just displaying the alertthat
Prometheus/AlertManager is reporting. The alert definition is here
<
https://github.com/ceph/ceph/blob/3993779cde9d10512f4a26f87487d11103ac1bd0/monitoring/ceph-mixin/prometheus_alerts.yml#L342-L351
.
As you may see, it's based on the status of the Prometheus "ceph"scrape
job. This alert is vital, because if the "ceph" job is not scraping
metrics
from the "mgr/prometheus" module, no other Ceph alert conditionwill be
detected, therefore creating a false sense of confidence.
You may start having a look at Prometheus and/or Alertmanager webUIs,
or
checking their logs.

Kind Regards,
Ernesto


On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway <t...@mousetech.com>
wrote:
Although I've had this problem since at least Pacific, I'm stillseeing
it on Reef.
After much pain and suffering (covered elsewhere), I got myPrometheusservices deployed as intended, Ceph health OK, green across theboard.
However, over the weekend, the dreaded
"CephMgrPrometheusModuleInactive" alert has returned to theDashboard.
"The mgr/prometheus module at dell02.mousetech.com:9283 is
unreachable."

It's a blatant lie.

I still get "Ceph HEALTH_OK". All monitor status command show
everything running. Checking ports on the host says it's listening.

More to the point, I can send my desktop browser to
http://dell02.mousetech.com:9283 and get a page that will allowme to
see the metrics. So everyone can see it but the Dashboard!
I did have some issues when the other prometheus host couldn'tresolvethe hostname, but I fixed that for all ceph hosts and it wasgreen for
days. Now the error is back. Restarting Prometheus didn't help.

How is the Dashboard hallucinating this???

     Tim
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Dashboard lies to me

Reply via email to