You're welcome! Thanks for sharing the RCA of the underlying issue. Kind Regards, Ernesto
On Mon, Apr 21, 2025 at 6:30 PM Tim Holloway <t...@mousetech.com> wrote: > OK. Found it. > > The primary prometheus node had a bad /etc/hosts. > > Most of my ceph nodes are on their own sub-domain, but a few have legacy > domain names and the uncontacted node is one of them. Since I have > wildcard resolution on DNS, Ceph was polling the wrong machine instead > of failing to resolve outright and it wasn't obvious because the > /etc/hosts on the other ceph nodes was set properly. > > The odd thing was, I thought I'd addressed that a while back, and as I > said things HAD been working. But when I started tracking into the inner > workings, I found that the entry had apparently reverted despite the > fact that I've done no further maintenance on them. I'll double-check my > master provisioner and if the problem comes back at least I'll know to > look more carefully. Pity the dashboard error doesn't include the > failing IP address along with the URL. > > Anyway, thanks all for the help! > > Tim > > On 4/21/25 09:46, Tim Holloway wrote: > > Thanks, but all I'm getting is the following every 10 minutes from the > > prometheus nodes: > > > > Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21 > > 09:29:32.252358201 -0400 EDT m=+0.039016913 container exec > > 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66 > > (image=quay.io/prometheus/node-exporter:v1.5.0, > > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02, > > maintainer=The Prometheus Authors > > <prometheus-develop...@googlegroups.com>) > > Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21 > > 09:29:32.259543374 -0400 EDT m=+0.046202087 container exec_died > > 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66 > > (image=quay.io/prometheus/node-exporter:v1.5.0, > > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02, > > maintainer=The Prometheus Authors > > <prometheus-develop...@googlegroups.com>) > > Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21 > > 09:29:32.947995363 -0400 EDT m=+0.036761633 container exec > > 71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c > > (image=quay.io/prometheus/prometheus:v2.43.0, > > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02, > > maintainer=The Prometheus Authors > > <prometheus-develop...@googlegroups.com>) > > Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21 > > 09:29:32.979516297 -0400 EDT m=+0.068282565 container exec_died > > 71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c > > (image=quay.io/prometheus/prometheus:v2.43.0, > > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02, > > maintainer=The Prometheus Authors > > <prometheus-develop...@googlegroups.com>) > > > > It looks like a process is failing inside the prometheus container, > > but the container itself remains in operation. > > > > On another note, I was all green until 2 days ago when a node went > > into overload. The node has recovered, but the prometheus alert is > > back and has been for 2 days. > > > > Tim > > > > > > On 4/21/25 07:25, Ernesto Puerta wrote: > >> You could check Alertmanager container logs > >> < > https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald> > > >> > >> . > >> > >> Kind Regards, > >> Ernesto > >> > >> > >> On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway <t...@mousetech.com> > wrote: > >> > >>> I'm thinking more some sort of latency error. > >>> > >>> I have 2 prometheus daemons running at the moment. The hosts files on > >>> all my ceph servers contain both hostname and FQDN. > >>> > >>> This morning the alert was gone. I don't know where I might find a log > >>> of when it comes and goes, but all was clean, then it wasn't, now it's > >>> clean again and I haven't been playing with any sort of configurations > >>> or bouncing hosts or services. It's just appearing and disappearing. > >>> > >>> Tim > >>> > >>> On 4/16/25 09:34, Ankush Behl wrote: > >>>> Just to add upon what Ernesto mentioned. Your prometheus container > >>>> might not be able to reachout to ceph scrape job as the it could be > >>>> using > >>>> FQDN or Hostname. Try updating /etc/hosts with ip and hostname of the > >>>> ceph scrape job(you can find it on prometheus UI -> status -> targets) > >>> and > >>>> restarting the prometheus after that might help resolve the issue. > >>>> > >>>> On Wed, Apr 16, 2025 at 2:10 PM Ernesto Puerta <epuer...@redhat.com> > >>> wrote: > >>>>> Don't shoot the messenger. Dashboard is just displaying the alert > >>>>> that > >>>>> Prometheus/AlertManager is reporting. The alert definition is here > >>>>> < > >>>>> > >>> > https://github.com/ceph/ceph/blob/3993779cde9d10512f4a26f87487d11103ac1bd0/monitoring/ceph-mixin/prometheus_alerts.yml#L342-L351 > >>> > >>>>>> . > >>>>> As you may see, it's based on the status of the Prometheus "ceph" > >>>>> scrape > >>>>> job. This alert is vital, because if the "ceph" job is not scraping > >>> metrics > >>>>> from the "mgr/prometheus" module, no other Ceph alert condition > >>>>> will be > >>>>> detected, therefore creating a false sense of confidence. > >>>>> > >>>>> You may start having a look at Prometheus and/or Alertmanager web > >>>>> UIs, > >>> or > >>>>> checking their logs. > >>>>> > >>>>> Kind Regards, > >>>>> Ernesto > >>>>> > >>>>> > >>>>> On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway <t...@mousetech.com> > >>> wrote: > >>>>>> Although I've had this problem since at least Pacific, I'm still > >>>>>> seeing > >>>>>> it on Reef. > >>>>>> > >>>>>> After much pain and suffering (covered elsewhere), I got my > >>>>>> Prometheus > >>>>>> services deployed as intended, Ceph health OK, green across the > >>>>>> board. > >>>>>> > >>>>>> However, over the weekend, the dreaded > >>>>>> "CephMgrPrometheusModuleInactive" alert has returned to the > >>>>>> Dashboard. > >>>>>> "The mgr/prometheus module at dell02.mousetech.com:9283 is > >>>>>> unreachable." > >>>>>> > >>>>>> It's a blatant lie. > >>>>>> > >>>>>> I still get "Ceph HEALTH_OK". All monitor status command show > >>>>>> everything running. Checking ports on the host says it's listening. > >>>>>> > >>>>>> More to the point, I can send my desktop browser to > >>>>>> http://dell02.mousetech.com:9283 and get a page that will allow > >>>>>> me to > >>>>>> see the metrics. So everyone can see it but the Dashboard! > >>>>>> > >>>>>> I did have some issues when the other prometheus host couldn't > >>>>>> resolve > >>>>>> the hostname, but I fixed that for all ceph hosts and it was > >>>>>> green for > >>>>>> days. Now the error is back. Restarting Prometheus didn't help. > >>>>>> > >>>>>> How is the Dashboard hallucinating this??? > >>>>>> > >>>>>> Tim > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>>> > >>>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@ceph.io > >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@ceph.io > >>>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io