On Tuesday, 25 April 2023 at 02:59:54 UTC+1 Christoph Anton Mitterer wrote:

In principle I'd like to do two things: 

a) have certain alert rules run only for certain instances 
(though that may in practise actually be less needed, when only the 
respective nodes would generate the respective metrics - not sure 
yet, whether this will be the case) 
b) silence certain (or all) alerts for a given set of instances 
e.g. these may be nodes where I'm not an admin how can take action 
on an incident, but just view the time series graphs to see what's 
going on


"Silence" has a special meaning in Prometheus: it means a temporary 
override to sending out alerts in alertmanager (typically for maintenance 
periods).

So really I'd divide the possibilities 3 ways:

a. Prevent the alert being generated from prometheus in the first place, by 
writing the expr in such a way that it filters out conditions that you 
don't want to alert on

b. Let the alert arrive at alertmanager, but permanently prevent it from 
sending out notifications for certain instances

c. Apply a temporary silence in alertmanager for certain alerts or groups 
of alerts

(1) is done by writing your 'expr' to match only specific instances or to 
exclude specific instances

(2) is done by matching on labels in your alertmanager routing rules (and 
if necessary, by adding extra labels in your 'expr')

(3) is done by creating a silence in alertmanager through its UI or API (or 
a frontend like karma or alerta.io)


 


As example I'll take an alert that fires when the root fs has >85% 
usage: 
groups: 
- name: node_alerts 
rules: 
- alert: node_free_fs_space 
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} 
* 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85 



With respect to (a): 
I could of course a yet another key like: 
instance=~"someRegexThatDescribesMyInstances" 
to each time series, but when that regex gets more complex, everything 
becomes quite unreadable and it's quite error prone to forget about a 
place (assuming one has many alerts) when the regex changes.


If you want to apply a threshold to only certain filesystems, and/or to 
have different thresholds per filesystem, then it's possible to put the 
thresholds in their own set of static timeseries:

https://www.robustperception.io/using-time-series-as-alert-thresholds

But I don't recommend this, and I find such alerts are brittle.  It helps 
to rethink exactly what you should be alerting on:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

For the majority of cases: "alert on symptoms, rather than causes".  That 
is, alert when a service isn't *working* (which you always need to know 
about), and in those alerts you can include potential cause-based 
information (e.g. CPU load is high, RAM is full, database is down etc).

Now, there are also some things you want to know about *before* they become 
a problem, like "disk is nearly full".  But the trouble with static alerts 
is, they are a pain to manage.  Suppose you have a threshold at 85%, and 
you have one server which is consistently at 86% but not growing - you know 
this is the case, you have no need to grow the filesystem, so you end up 
tweaking thresholds per instance.

I would suggest two alternatives:

1. Check dashboards daily.  If you want automatic notifications then don't 
send the sort of alert which gets someone out of bed, but a "FYI" 
notification to something like Slack or Teams.

2. Write dynamic alerts, e.g. have alerting rules which identify disk usage 
which is growing rapidly and likely to fill in the next few hours or days.

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill 
in 2 hours
  - alert: DiskFilling10m
    expr: |
        node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
        
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 
7200) < 0)) * 7200
    for: 20m
    labels:
      severity: critical
    annotations:
      summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 10m growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 
2 days
  - alert: DiskFilling3h
    expr: |
        node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
        
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 
172800) < 0)) * 172800
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 3h growth rate'


Not sure whether anything can be done better via adding labels at some 
stage.


As well as target labels, you can set labels in the alerting rules 
themselves, for when an alert fires. That doesn't help you filter the alert 
expr itself, but it can be useful when deciding how to route the 
notification in alertmanager.

Target labels are a decent way to classify machines, e.g. target labels for 
"development" and "production" mean that you can easily alert or dispatch 
alerts differently for those two environments.  But you should beware of 
changing them frequently, because every time the set of labels on a metric 
changes, it becomes a new timeseries.  This makes it hard to follow the 
history of the metric.

If you want to do really clever stuff like classifying hosts dynamically, 
then you can do it by having *separate* timeseries for those 
classifications:

https://www.robustperception.io/how-to-have-labels-for-machine-roles
https://www.robustperception.io/exposing-the-software-version-to-prometheus
https://www.robustperception.io/left-joins-in-promql

Again, unless you really need it, this is arguably getting "too clever" - 
and it will make the actual alerting rules more complex.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/71be9938-aec8-4f42-a25a-2253bf8ffa08n%40googlegroups.com.

Reply via email to