Hey Brian On Tuesday, April 25, 2023 at 9:59:12 AM UTC+2 Brian Candler wrote:
So really I'd divide the possibilities 3 ways: a. Prevent the alert being generated from prometheus in the first place, by writing the expr in such a way that it filters out conditions that you don't want to alert on b. Let the alert arrive at alertmanager, but permanently prevent it from sending out notifications for certain instances c. Apply a temporary silence in alertmanager for certain alerts or groups of alerts (1) is done by writing your 'expr' to match only specific instances or to exclude specific instances (2) is done by matching on labels in your alertmanager routing rules (and if necessary, by adding extra labels in your 'expr') I think in my case (where I want to simply get no alerts at all for a certain group of instances) it would be (1) or (2), with (1) probably being the cleaner one. I guess with (2) you also meant having a route which is then permanently muted? If you want to apply a threshold to only certain filesystems, and/or to have different thresholds per filesystem, then it's possible to put the thresholds in their own set of static timeseries: https://www.robustperception.io/using-time-series-as-alert-thresholds But I don't recommend this, and I find such alerts are brittle. Would also sound like a solution that's a bit over-engineered to me. It helps to rethink exactly what you should be alerting on: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit For the majority of cases: "alert on symptoms, rather than causes". That is, alert when a service isn't *working* (which you always need to know about), and in those alerts you can include potential cause-based information (e.g. CPU load is high, RAM is full, database is down etc). Now, there are also some things you want to know about *before* they become a problem, like "disk is nearly full". But the trouble with static alerts is, they are a pain to manage. Suppose you have a threshold at 85%, and you have one server which is consistently at 86% but not growing - you know this is the case, you have no need to grow the filesystem, so you end up tweaking thresholds per instance. I would suggest two alternatives: 1. Check dashboards daily. If you want automatic notifications then don't send the sort of alert which gets someone out of bed, but a "FYI" notification to something like Slack or Teams. 2. Write dynamic alerts, e.g. have alerting rules which identify disk usage which is growing rapidly and likely to fill in the next few hours or days. - name: DiskRate10m interval: 1m rules: # Warn if rate of growth over last 10 minutes means filesystem will fill in 2 hours - alert: DiskFilling10m expr: | node_filesystem_avail_bytes / (node_filesystem_avail_bytes - (predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 7200) < 0)) * 7200 for: 20m labels: severity: critical annotations: summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 10m growth rate' - name: DiskRate3h interval: 10m rules: # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days - alert: DiskFilling3h expr: | node_filesystem_avail_bytes / (node_filesystem_avail_bytes - (predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 172800) < 0)) * 172800 for: 6h labels: severity: warning annotations: summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 3h growth rate' Thanks but I'm not sure sure whether the above applies to my scenario. For me it's really like this: My Prometheus instance monitors: - my "own" instances, where I need to react on things like >85% usage on root filesystem (and thus want to get an alert) - "foreign" instances, where I just get the node exporter data and show e.g. CPU usage, IO usage, and so on as a convenience to users of our cluster - but any alert conditions wouldn't cause any further action on my side (and the guys in charge of those servers have their own monitoring) So in the end it just boils down to my desire to keep my alert rules small/simple/readable. expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85 => would fire for all nodes, bad expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}) >= 85 => would work, I guess, but seems really ugly to read/maintain Not sure whether anything can be done better via adding labels at some stage. As well as target labels, you can set labels in the alerting rules themselves, for when an alert fires. That doesn't help you filter the alert expr itself, but it can be useful when deciding how to route the notification in alertmanager. Those (target labels) are the ones that would get saved in the TSDB, right? Target labels are a decent way to classify machines, e.g. target labels for "development" and "production" mean that you can easily alert or dispatch alerts differently for those two environments. But you should beware of changing them frequently, because every time the set of labels on a metric changes, it becomes a new timeseries. This makes it hard to follow the history of the metric. Which is why I would rather not want to use them (for that purpose). If you want to do really clever stuff like classifying hosts dynamically, then you can do it by having *separate* timeseries for those classifications: https://www.robustperception.io/how-to-have-labels-for-machine-roles https://www.robustperception.io/exposing-the-software-version-to-prometheus https://www.robustperception.io/left-joins-in-promql Again, unless you really need it, this is arguably getting "too clever" - and it will make the actual alerting rules more complex. Which (making the rules complex) is just what I want to avoid... plus, new time series means more storage usage. >From all that it seems to me that the "best" solution is either: a) simply making more complex and error prone alert rules, that filter out the instances in the first place, like in: expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}) >= 85 b) The idea hat I had above: - using <alert_relabel_configs> to filter on the instances and add a label if it should be silenced - use only that label in the expr instead of the full regex But would that even work? Cause documentation says "Alert relabeling is applied to alerts before they are sent to the Alertmanager."... but the alert rules are already evaluated before, right? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/92559a54-5e1d-4892-9574-e3cb8fc89a70n%40googlegroups.com.

