[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Christoph Anton Mitterer Tue, 25 Apr 2023 17:04:31 -0700

Hey Brian

On Tuesday, April 25, 2023 at 9:59:12 AM UTC+2 Brian Candler wrote:


So really I'd divide the possibilities 3 ways:

a. Prevent the alert being generated from prometheus in the first place, by 
writing the expr in such a way that it filters out conditions that you 
don't want to alert on

b. Let the alert arrive at alertmanager, but permanently prevent it from 
sending out notifications for certain instances

c. Apply a temporary silence in alertmanager for certain alerts or groups 
of alerts

(1) is done by writing your 'expr' to match only specific instances or to 
exclude specific instances

(2) is done by matching on labels in your alertmanager routing rules (and 
if necessary, by adding extra labels in your 'expr')


I think in my case (where I want to simply get no alerts at all for a 
certain group of instances) it would be (1) or (2), with (1) probably being 
the cleaner one.

I guess with (2) you also meant having a route which is then permanently 
muted?


If you want to apply a threshold to only certain filesystems, and/or to 
have different thresholds per filesystem, then it's possible to put the 
thresholds in their own set of static timeseries:

https://www.robustperception.io/using-time-series-as-alert-thresholds

But I don't recommend this, and I find such alerts are brittle.


Would also sound like a solution that's a bit over-engineered to me.
 

  It helps to rethink exactly what you should be alerting on:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

For the majority of cases: "alert on symptoms, rather than causes".  That 
is, alert when a service isn't *working* (which you always need to know 
about), and in those alerts you can include potential cause-based 
information (e.g. CPU load is high, RAM is full, database is down etc).

Now, there are also some things you want to know about *before* they become 
a problem, like "disk is nearly full".  But the trouble with static alerts 
is, they are a pain to manage.  Suppose you have a threshold at 85%, and 
you have one server which is consistently at 86% but not growing - you know 
this is the case, you have no need to grow the filesystem, so you end up 
tweaking thresholds per instance.

I would suggest two alternatives:

1. Check dashboards daily.  If you want automatic notifications then don't 
send the sort of alert which gets someone out of bed, but a "FYI" 
notification to something like Slack or Teams.

2. Write dynamic alerts, e.g. have alerting rules which identify disk usage 
which is growing rapidly and likely to fill in the next few hours or days.

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill 
in 2 hours
  - alert: DiskFilling10m
    expr: |
        node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
        
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 
7200) < 0)) * 7200
    for: 20m
    labels:
      severity: critical
    annotations:
      summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 10m growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 
2 days
  - alert: DiskFilling3h
    expr: |
        node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
        
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 
172800) < 0)) * 172800
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 3h growth rate'


Thanks but I'm not sure sure whether the above applies to my scenario.

For me it's really like this:
My Prometheus instance monitors:
- my "own" instances, where I need to react on things like >85% usage on 
root filesystem (and thus want to get an alert)
- "foreign" instances, where I just get the node exporter data and show 
e.g. CPU usage, IO usage, and so on as a convenience to users of our 
cluster - but any alert conditions wouldn't cause any further action on my 
side (and the guys in charge of those servers have their own monitoring)

So in the end it just boils down to my desire to keep my alert rules 
small/simple/readable.
   expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85 
=> would fire for all nodes, bad

   expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
 
>= 85 
=> would work, I guess, but seems really ugly to read/maintain


 

Not sure whether anything can be done better via adding labels at some 
stage.


As well as target labels, you can set labels in the alerting rules 
themselves, for when an alert fires. That doesn't help you filter the alert 
expr itself, but it can be useful when deciding how to route the 
notification in alertmanager.


Those (target labels) are the ones that would get saved in the TSDB, right?
 

Target labels are a decent way to classify machines, e.g. target labels for 
"development" and "production" mean that you can easily alert or dispatch 
alerts differently for those two environments.  But you should beware of 
changing them frequently, because every time the set of labels on a metric 
changes, it becomes a new timeseries.  This makes it hard to follow the 
history of the metric.


Which is why I would rather not want to use them (for that purpose).

 

If you want to do really clever stuff like classifying hosts dynamically, 
then you can do it by having *separate* timeseries for those 
classifications:

https://www.robustperception.io/how-to-have-labels-for-machine-roles
https://www.robustperception.io/exposing-the-software-version-to-prometheus
https://www.robustperception.io/left-joins-in-promql

Again, unless you really need it, this is arguably getting "too clever" - 
and it will make the actual alerting rules more complex.


Which (making the rules complex) is just what I want to avoid... plus, new 
time series means more storage usage.


>From all that it seems to me that the "best" solution is either:
a) simply making more complex and error prone alert rules, that filter out 
the instances in the first place, like in: 
   expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
 
>= 85 

b) The idea hat I had above:
- using <alert_relabel_configs> to filter on the instances and add a label 
if it should be silenced
- use only that label in the expr instead of the full regex
But would that even work?
Cause documentation says "Alert relabeling is applied to alerts before they 
are sent to the Alertmanager."... but the alert rules are already evaluated 
before, right?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/92559a54-5e1d-4892-9574-e3cb8fc89a70n%40googlegroups.com.

[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Reply via email to