[ceph-users] Re: Dashboard lies to me

Tim Holloway Mon, 21 Apr 2025 06:47:40 -0700

Thanks, but all I'm getting is the following every 10 minutes from theprometheus nodes:

Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-2109:29:32.252358201 -0400 EDT m=+0.039016913 container exec2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66(image=quay.io/prometheus/node-exporter:v1.5.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>)Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-2109:29:32.259543374 -0400 EDT m=+0.046202087 container exec_died2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66(image=quay.io/prometheus/node-exporter:v1.5.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>)Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-2109:29:32.947995363 -0400 EDT m=+0.036761633 container exec71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c(image=quay.io/prometheus/prometheus:v2.43.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>)Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-2109:29:32.979516297 -0400 EDT m=+0.068282565 container exec_died71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c(image=quay.io/prometheus/prometheus:v2.43.0,name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,maintainer=The Prometheus Authors <prometheus-develop...@googlegroups.com>)

It looks like a process is failing inside the prometheus container, butthe container itself remains in operation.

On another note, I was all green until 2 days ago when a node went intooverload. The node has recovered, but the prometheus alert is back andhas been for 2 days.


   Tim


On 4/21/25 07:25, Ernesto Puerta wrote:

You could check Alertmanager container logs
<https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald>
.

Kind Regards,
Ernesto


On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway <t...@mousetech.com> wrote:

I'm thinking more some sort of latency error.

I have 2 prometheus daemons running at the moment. The hosts files on
all my ceph servers contain both hostname and FQDN.

This morning the alert was gone. I don't know where I might find a log
of when it comes and goes, but all was clean, then it wasn't, now it's
clean again and I haven't been playing with any sort of configurations
or bouncing hosts or services. It's just appearing and disappearing.

     Tim

On 4/16/25 09:34, Ankush Behl wrote:

Just to add upon what Ernesto mentioned. Your prometheus container
might not be able to reachout to ceph scrape job as the it could be using
FQDN or Hostname. Try updating /etc/hosts with ip and hostname of the
ceph scrape job(you can find it on prometheus UI -> status -> targets)

and

restarting the prometheus after that might help resolve the issue.

On Wed, Apr 16, 2025 at 2:10 PM Ernesto Puerta <epuer...@redhat.com>

wrote:

Don't shoot the messenger. Dashboard is just displaying the alert that
Prometheus/AlertManager is reporting. The alert definition is here
<

https://github.com/ceph/ceph/blob/3993779cde9d10512f4a26f87487d11103ac1bd0/monitoring/ceph-mixin/prometheus_alerts.yml#L342-L351

As you may see, it's based on the status of the Prometheus "ceph" scrape
job. This alert is vital, because if the "ceph" job is not scraping

metrics

from the "mgr/prometheus" module, no other Ceph alert condition will be
detected, therefore creating a false sense of confidence.

You may start having a look at Prometheus and/or Alertmanager web UIs,

or

checking their logs.

Kind Regards,
Ernesto


On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway <t...@mousetech.com>

wrote:

Although I've had this problem since at least Pacific, I'm still seeing
it on Reef.

After much pain and suffering (covered elsewhere), I got my Prometheus
services deployed as intended, Ceph health OK, green across the board.

However, over the weekend, the dreaded
"CephMgrPrometheusModuleInactive" alert has returned to the Dashboard.
"The mgr/prometheus module at dell02.mousetech.com:9283 is
unreachable."

It's a blatant lie.

I still get "Ceph HEALTH_OK". All monitor status command show
everything running. Checking ports on the host says it's listening.

More to the point, I can send my desktop browser to
http://dell02.mousetech.com:9283 and get a page that will allow me to
see the metrics. So everyone can see it but the Dashboard!

I did have some issues when the other prometheus host couldn't resolve
the hostname, but I fixed that for all ceph hosts and it was green for
days. Now the error is back. Restarting Prometheus didn't help.

How is the Dashboard hallucinating this???

     Tim
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Dashboard lies to me

Reply via email to