You're welcome! Thanks for sharing the RCA of the underlying issue.

Kind Regards,
Ernesto


On Mon, Apr 21, 2025 at 6:30 PM Tim Holloway <t...@mousetech.com> wrote:

> OK. Found it.
>
> The primary prometheus node had a bad /etc/hosts.
>
> Most of my ceph nodes are on their own sub-domain, but a few have legacy
> domain names and the uncontacted node is one of them. Since I have
> wildcard resolution on DNS, Ceph was polling the wrong machine instead
> of failing to resolve outright and it wasn't obvious because the
> /etc/hosts on the other ceph nodes was set properly.
>
> The odd thing was, I thought I'd addressed that a while back, and as I
> said things HAD been working. But when I started tracking into the inner
> workings, I found that the entry had apparently reverted despite the
> fact that I've done no further maintenance on them. I'll double-check my
> master provisioner and if the problem comes back at least I'll know to
> look more carefully. Pity the dashboard error doesn't include the
> failing IP address along with the URL.
>
> Anyway, thanks all for the help!
>
>     Tim
>
> On 4/21/25 09:46, Tim Holloway wrote:
> > Thanks, but all I'm getting is the following every 10 minutes from the
> > prometheus nodes:
> >
> > Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21
> > 09:29:32.252358201 -0400 EDT m=+0.039016913 container exec
> > 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66
> > (image=quay.io/prometheus/node-exporter:v1.5.0,
> > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,
> > maintainer=The Prometheus Authors
> > <prometheus-develop...@googlegroups.com>)
> > Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21
> > 09:29:32.259543374 -0400 EDT m=+0.046202087 container exec_died
> > 2149e16fa2ce8769bf3be9e6e25eec61b8e027b0e8699f1cb7d5f113fc4aac66
> > (image=quay.io/prometheus/node-exporter:v1.5.0,
> > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-node-exporter-dell02,
> > maintainer=The Prometheus Authors
> > <prometheus-develop...@googlegroups.com>)
> > Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21
> > 09:29:32.947995363 -0400 EDT m=+0.036761633 container exec
> > 71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c
> > (image=quay.io/prometheus/prometheus:v2.43.0,
> > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,
> > maintainer=The Prometheus Authors
> > <prometheus-develop...@googlegroups.com>)
> > Apr 21 09:29:32 dell02.mousetech.com podman[997518]: 2025-04-21
> > 09:29:32.979516297 -0400 EDT m=+0.068282565 container exec_died
> > 71c75380a0b63fa8a31fa296c733d8385309b44325813450fcd30670b249157c
> > (image=quay.io/prometheus/prometheus:v2.43.0,
> > name=ceph-278fcd86-0861-11ee-a7df-9c5c8e86cf8f-prometheus-dell02,
> > maintainer=The Prometheus Authors
> > <prometheus-develop...@googlegroups.com>)
> >
> > It looks like a process is failing inside the prometheus container,
> > but the container itself remains in operation.
> >
> > On another note, I was all green until 2 days ago when a node went
> > into overload. The node has recovered, but the prometheus alert is
> > back and has been for 2 days.
> >
> >    Tim
> >
> >
> > On 4/21/25 07:25, Ernesto Puerta wrote:
> >> You could check Alertmanager container logs
> >> <
> https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald>
>
> >>
> >> .
> >>
> >> Kind Regards,
> >> Ernesto
> >>
> >>
> >> On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway <t...@mousetech.com>
> wrote:
> >>
> >>> I'm thinking more some sort of latency error.
> >>>
> >>> I have 2 prometheus daemons running at the moment. The hosts files on
> >>> all my ceph servers contain both hostname and FQDN.
> >>>
> >>> This morning the alert was gone. I don't know where I might find a log
> >>> of when it comes and goes, but all was clean, then it wasn't, now it's
> >>> clean again and I haven't been playing with any sort of configurations
> >>> or bouncing hosts or services. It's just appearing and disappearing.
> >>>
> >>>      Tim
> >>>
> >>> On 4/16/25 09:34, Ankush Behl wrote:
> >>>> Just to add upon what Ernesto mentioned. Your prometheus container
> >>>> might not be able to reachout to ceph scrape job as the it could be
> >>>> using
> >>>> FQDN or Hostname. Try updating /etc/hosts with ip and hostname of the
> >>>> ceph scrape job(you can find it on prometheus UI -> status -> targets)
> >>> and
> >>>> restarting the prometheus after that might help resolve the issue.
> >>>>
> >>>> On Wed, Apr 16, 2025 at 2:10 PM Ernesto Puerta <epuer...@redhat.com>
> >>> wrote:
> >>>>> Don't shoot the messenger. Dashboard is just displaying the alert
> >>>>> that
> >>>>> Prometheus/AlertManager is reporting. The alert definition is here
> >>>>> <
> >>>>>
> >>>
> https://github.com/ceph/ceph/blob/3993779cde9d10512f4a26f87487d11103ac1bd0/monitoring/ceph-mixin/prometheus_alerts.yml#L342-L351
> >>>
> >>>>>> .
> >>>>> As you may see, it's based on the status of the Prometheus "ceph"
> >>>>> scrape
> >>>>> job. This alert is vital, because if the "ceph" job is not scraping
> >>> metrics
> >>>>> from the "mgr/prometheus" module, no other Ceph alert condition
> >>>>> will be
> >>>>> detected, therefore creating a false sense of confidence.
> >>>>>
> >>>>> You may start having a look at Prometheus and/or Alertmanager web
> >>>>> UIs,
> >>> or
> >>>>> checking their logs.
> >>>>>
> >>>>> Kind Regards,
> >>>>> Ernesto
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway <t...@mousetech.com>
> >>> wrote:
> >>>>>> Although I've had this problem since at least Pacific, I'm still
> >>>>>> seeing
> >>>>>> it on Reef.
> >>>>>>
> >>>>>> After much pain and suffering (covered elsewhere), I got my
> >>>>>> Prometheus
> >>>>>> services deployed as intended, Ceph health OK, green across the
> >>>>>> board.
> >>>>>>
> >>>>>> However, over the weekend, the dreaded
> >>>>>> "CephMgrPrometheusModuleInactive" alert has returned to the
> >>>>>> Dashboard.
> >>>>>> "The mgr/prometheus module at dell02.mousetech.com:9283 is
> >>>>>> unreachable."
> >>>>>>
> >>>>>> It's a blatant lie.
> >>>>>>
> >>>>>> I still get "Ceph HEALTH_OK". All monitor status command show
> >>>>>> everything running. Checking ports on the host says it's listening.
> >>>>>>
> >>>>>> More to the point, I can send my desktop browser to
> >>>>>> http://dell02.mousetech.com:9283 and get a page that will allow
> >>>>>> me to
> >>>>>> see the metrics. So everyone can see it but the Dashboard!
> >>>>>>
> >>>>>> I did have some issues when the other prometheus host couldn't
> >>>>>> resolve
> >>>>>> the hostname, but I fixed that for all ceph hosts and it was
> >>>>>> green for
> >>>>>> days. Now the error is back. Restarting Prometheus didn't help.
> >>>>>>
> >>>>>> How is the Dashboard hallucinating this???
> >>>>>>
> >>>>>>      Tim
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to