[ceph-users] Re: Modify or override ceph_default_alerts.yml

Devin A. Bougie Mon, 13 Jan 2025 18:56:43 -0800

Hi Eugen,

No, as far as I can tell I only have one prometheus service running.


———

[root@cephman2 ~]# ceph orch ls prometheus --export

service_type: prometheus

service_name: prometheus

placement:

  count: 1

  label: _admin


[root@cephman2 ~]# ceph orch ps --daemon-type prometheus

NAME                 HOST                         PORTS   STATUS         
REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID

prometheus.cephman2  cephman2.classe.cornell.edu  *:9095  running (12h)     4m 
ago   3w     350M        -  2.43.0   a07b618ecd1d  5a8d88682c28

———

Anything else I can check or do?

Thanks,
Devin

On Jan 13, 2025, at 6:39 PM, Eugen Block <ebl...@nde.ag> wrote:

Do you have two Prometheus instances? Maybe you could share
ceph orch ls prometheus --export

Or alternatively:
ceph orch ps --daemon-type prometheus

You can use two instances for HA, but then you need to change the threshold for 
both, of course.

Zitat von "Devin A. Bougie" 
<devin.bou...@cornell.edu<mailto:devin.bou...@cornell.edu>>:

Thanks, Eugen!  Just incase you have any more suggestions, this still isn’t 
quite working for us.

Perhaps one clue is that in the Alerts view of the cephadm dashboard, every 
alert is listed twice.  We see two CephPGImbalance alerts, both set to 30% 
after redeploying the service.  If I then follow your procedure, one of the 
alerts updates to 50% as configured, but the other stays at 30.  Is it normal 
to see each alert listed twice, or did I somehow make a mess of things when 
trying to change the default alerts?

No problem if it’s not an obvious answer, we can live with and ignore the 
spurious CephPGImbalance alerts.

Thanks again,
Devin

On Jan 7, 2025, at 2:14 AM, Eugen Block <ebl...@nde.ag> wrote:

Hi,

sure thing, here's the diff how I changed it to 50% deviation instead of 30%:

---snip---
diff -u 
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml 
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist
--- 
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml   
 2024-12-17 10:03:23.540179209 +0100
+++ 
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist
       2024-12-17 10:03:00.380883413 +0100
@@ -237,13 +237,13 @@
         type: "ceph_default"
     - alert: "CephPGImbalance"
       annotations:
-          description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname 
}} deviates by more than 50% from average PG count."
+          description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname 
}} deviates by more than 30% from average PG count."
         summary: "PGs are not balanced across OSDs"
       expr: |
         abs(
           ((ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) 
by (job)) /
           on (job) group_left avg(ceph_osd_numpg > 0) by (job)
-          ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.50
+          ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
---snip---

Then you restart prometheus ('ceph orch ps --daemon-type prometheus' shows you 
the exact daemon name):

ceph orch daemon restart prometheus.host1

This will only work until you upgrade prometheus, of course.

Regards,
Eugen


Zitat von "Devin A. Bougie" <devin.bou...@cornell.edu>:

Thanks, Eugen.  I’m afraid I haven’t yet found a way to either disable the 
CephPGImbalance alert or change it to handle different OSD sizes.  Changing 
/var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t seem to have 
any effect, and I haven’t even managed to change the behavior from within the 
running prometheus container.

If you have a functioning workaround, can you give a little more detail on 
exactly what yaml file you’re changing and where?

Thanks again,
Devin

On Dec 30, 2024, at 12:39 PM, Eugen Block <ebl...@nde.ag> wrote:

Funny, I wanted to take a look next week how to deal with different OSD sizes 
or if somebody already has a fix for that. My workaround is changing the yaml 
file for Prometheus as well.

Zitat von "Devin A. Bougie" <devin.bou...@cornell.edu>:

Hi, All.  We are using cephadm to manage a 19.2.0 cluster on fully-updated 
AlmaLinux 9 hosts, and would greatly appreciate help modifying or overriding 
the alert rules in ceph_default_alerts.yml.  Is the best option to simply 
update the /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file?

In particular, we’d like to either disable the CephPGImbalance alert or change 
it to calculate averages per-pool or per-crush_rule instead of globally as in 
[1].

We currently have PG autoscaling enabled, and have two separate crush_rules 
(one with large spinning disks, one with much smaller nvme drives).  Although I 
don’t believe it causes any technical issues with our configuration, our 
dashboard is full of CephPGImbalance alerts that would be nice to clean up 
without having to create periodic silences.

Any help or suggestions would be greatly appreciated.

Many thanks,
Devin

[1] 
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frook%2Frook%2Fdiscussions%2F13126%23discussioncomment-10043490&data=05%7C02%7Cdevin.bougie%40cornell.edu%7C27dddfa00b2e4475b30e08dd342b82cc%7C5d7e43661b9b45cf8e79b14b27df46e1%7C0%7C0%7C638724083682129542%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2FrmKNJ1hVwWdQ5U05CvNXX1Df3f4SR2HAxTyZA3PJKw%3D&reserved=0<https://github.com/rook/rook/discussions/13126#discussioncomment-10043490>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Modify or override ceph_default_alerts.yml

Reply via email to