I just wanted to followup to explain how we ended up with each alert being listed twice, which also prevented our changes to ceph_alerts.yml from taking effect.
We only had one prometheus service running, and only one PGImbalance rule in the /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml file. *However*, before modifying it I had first backed up the original file to /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml.bk Once I removed the ceph_alerts.yml.bk file, the dashboard only showed one alert rule as it should (modified for a deviation of 90%) and all of the “30%” active alerts cleared. So for now, at least until we figure out how to override a given alert using templates, Eugen’s procedure works fine. 1. Modify (but don’t backup or rename) /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml 2. Restart prometheus Many thanks to Eugen for their help tracking this down! Sincerely, Devin > On Jan 13, 2025, at 9:55 PM, Devin A. Bougie <devin.bou...@cornell.edu> wrote: > > Hi Eugen, > > No, as far as I can tell I only have one prometheus service running. > > ——— > [root@cephman2 ~]# ceph orch ls prometheus --export > service_type: prometheus > service_name: prometheus > placement: > count: 1 > label: _admin > > [root@cephman2 ~]# ceph orch ps --daemon-type prometheus > NAME HOST PORTS STATUS > REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID > prometheus.cephman2 cephman2.classe.cornell.edu *:9095 running (12h) > 4m ago 3w 350M - 2.43.0 a07b618ecd1d 5a8d88682c28 > ——— > > Anything else I can check or do? > > Thanks, > Devin > >> On Jan 13, 2025, at 6:39 PM, Eugen Block <ebl...@nde.ag> wrote: >> >> Do you have two Prometheus instances? Maybe you could share >> ceph orch ls prometheus --export >> >> Or alternatively: >> ceph orch ps --daemon-type prometheus >> >> You can use two instances for HA, but then you need to change the threshold >> for both, of course. >> >> Zitat von "Devin A. Bougie" <devin.bou...@cornell.edu>: >> >>> Thanks, Eugen! Just incase you have any more suggestions, this still isn’t >>> quite working for us. >>> >>> Perhaps one clue is that in the Alerts view of the cephadm dashboard, every >>> alert is listed twice. We see two CephPGImbalance alerts, both set to 30% >>> after redeploying the service. If I then follow your procedure, one of the >>> alerts updates to 50% as configured, but the other stays at 30. Is it >>> normal to see each alert listed twice, or did I somehow make a mess of >>> things when trying to change the default alerts? >>> >>> No problem if it’s not an obvious answer, we can live with and ignore the >>> spurious CephPGImbalance alerts. >>> >>> Thanks again, >>> Devin >>> >>>> On Jan 7, 2025, at 2:14 AM, Eugen Block <ebl...@nde.ag> wrote: >>>> >>>> Hi, >>>> >>>> sure thing, here's the diff how I changed it to 50% deviation instead of >>>> 30%: >>>> >>>> ---snip--- >>>> diff -u >>>> /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml >>>> >>>> /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist >>>> --- >>>> /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml >>>> 2024-12-17 10:03:23.540179209 +0100 >>>> +++ >>>> /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist >>>> 2024-12-17 10:03:00.380883413 +0100 >>>> @@ -237,13 +237,13 @@ >>>> type: "ceph_default" >>>> - alert: "CephPGImbalance" >>>> annotations: >>>> - description: "OSD {{ $labels.ceph_daemon }} on {{ >>>> $labels.hostname }} deviates by more than 50% from average PG count." >>>> + description: "OSD {{ $labels.ceph_daemon }} on {{ >>>> $labels.hostname }} deviates by more than 30% from average PG count." >>>> summary: "PGs are not balanced across OSDs" >>>> expr: | >>>> abs( >>>> ((ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg >>>> > 0) by (job)) / >>>> on (job) group_left avg(ceph_osd_numpg > 0) by (job) >>>> - ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > >>>> 0.50 >>>> + ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > >>>> 0.30 >>>> ---snip--- >>>> >>>> Then you restart prometheus ('ceph orch ps --daemon-type prometheus' shows >>>> you the exact daemon name): >>>> >>>> ceph orch daemon restart prometheus.host1 >>>> >>>> This will only work until you upgrade prometheus, of course. >>>> >>>> Regards, >>>> Eugen >>>> >>>> >>>> Zitat von "Devin A. Bougie" <devin.bou...@cornell.edu>: >>>> >>>>> Thanks, Eugen. I’m afraid I haven’t yet found a way to either disable >>>>> the CephPGImbalance alert or change it to handle different OSD sizes. >>>>> Changing /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t >>>>> seem to have any effect, and I haven’t even managed to change the >>>>> behavior from within the running prometheus container. >>>>> >>>>> If you have a functioning workaround, can you give a little more detail >>>>> on exactly what yaml file you’re changing and where? >>>>> >>>>> Thanks again, >>>>> Devin >>>>> >>>>>> On Dec 30, 2024, at 12:39 PM, Eugen Block <ebl...@nde.ag> wrote: >>>>>> >>>>>> Funny, I wanted to take a look next week how to deal with different OSD >>>>>> sizes or if somebody already has a fix for that. My workaround is >>>>>> changing the yaml file for Prometheus as well. >>>>>> >>>>>> Zitat von "Devin A. Bougie" <devin.bou...@cornell.edu>: >>>>>> >>>>>>> Hi, All. We are using cephadm to manage a 19.2.0 cluster on >>>>>>> fully-updated AlmaLinux 9 hosts, and would greatly appreciate help >>>>>>> modifying or overriding the alert rules in ceph_default_alerts.yml. Is >>>>>>> the best option to simply update the >>>>>>> /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file? >>>>>>> >>>>>>> In particular, we’d like to either disable the CephPGImbalance alert or >>>>>>> change it to calculate averages per-pool or per-crush_rule instead of >>>>>>> globally as in [1]. >>>>>>> >>>>>>> We currently have PG autoscaling enabled, and have two separate >>>>>>> crush_rules (one with large spinning disks, one with much smaller nvme >>>>>>> drives). Although I don’t believe it causes any technical issues with >>>>>>> our configuration, our dashboard is full of CephPGImbalance alerts that >>>>>>> would be nice to clean up without having to create periodic silences. >>>>>>> >>>>>>> Any help or suggestions would be greatly appreciated. >>>>>>> >>>>>>> Many thanks, >>>>>>> Devin >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/rook/rook/discussions/13126#discussioncomment-10043490 >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io