On Tuesday, 25 April 2023 at 02:59:54 UTC+1 Christoph Anton Mitterer wrote:
In principle I'd like to do two things:
a) have certain alert rules run only for certain instances
(though that may in practise actually be less needed, when only the
respective nodes would generate the respective metrics - not sure
yet, whether this will be the case)
b) silence certain (or all) alerts for a given set of instances
e.g. these may be nodes where I'm not an admin how can take action
on an incident, but just view the time series graphs to see what's
going on
"Silence" has a special meaning in Prometheus: it means a temporary
override to sending out alerts in alertmanager (typically for maintenance
periods).
So really I'd divide the possibilities 3 ways:
a. Prevent the alert being generated from prometheus in the first place, by
writing the expr in such a way that it filters out conditions that you
don't want to alert on
b. Let the alert arrive at alertmanager, but permanently prevent it from
sending out notifications for certain instances
c. Apply a temporary silence in alertmanager for certain alerts or groups
of alerts
(1) is done by writing your 'expr' to match only specific instances or to
exclude specific instances
(2) is done by matching on labels in your alertmanager routing rules (and
if necessary, by adding extra labels in your 'expr')
(3) is done by creating a silence in alertmanager through its UI or API (or
a frontend like karma or alerta.io)
As example I'll take an alert that fires when the root fs has >85%
usage:
groups:
- name: node_alerts
rules:
- alert: node_free_fs_space
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"}
* 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85
With respect to (a):
I could of course a yet another key like:
instance=~"someRegexThatDescribesMyInstances"
to each time series, but when that regex gets more complex, everything
becomes quite unreadable and it's quite error prone to forget about a
place (assuming one has many alerts) when the regex changes.
If you want to apply a threshold to only certain filesystems, and/or to
have different thresholds per filesystem, then it's possible to put the
thresholds in their own set of static timeseries:
https://www.robustperception.io/using-time-series-as-alert-thresholds
But I don't recommend this, and I find such alerts are brittle. It helps
to rethink exactly what you should be alerting on:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
For the majority of cases: "alert on symptoms, rather than causes". That
is, alert when a service isn't *working* (which you always need to know
about), and in those alerts you can include potential cause-based
information (e.g. CPU load is high, RAM is full, database is down etc).
Now, there are also some things you want to know about *before* they become
a problem, like "disk is nearly full". But the trouble with static alerts
is, they are a pain to manage. Suppose you have a threshold at 85%, and
you have one server which is consistently at 86% but not growing - you know
this is the case, you have no need to grow the filesystem, so you end up
tweaking thresholds per instance.
I would suggest two alternatives:
1. Check dashboards daily. If you want automatic notifications then don't
send the sort of alert which gets someone out of bed, but a "FYI"
notification to something like Slack or Teams.
2. Write dynamic alerts, e.g. have alerting rules which identify disk usage
which is growing rapidly and likely to fill in the next few hours or days.
- name: DiskRate10m
interval: 1m
rules:
# Warn if rate of growth over last 10 minutes means filesystem will fill
in 2 hours
- alert: DiskFilling10m
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m],
7200) < 0)) * 7200
for: 20m
labels:
severity: critical
annotations:
summary: 'Filesystem will be full in {{ $value | humanizeDuration }}
at current 10m growth rate'
- name: DiskRate3h
interval: 10m
rules:
# Warn if rate of growth over last 3 hours means filesystem will fill in
2 days
- alert: DiskFilling3h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h],
172800) < 0)) * 172800
for: 6h
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in {{ $value | humanizeDuration }}
at current 3h growth rate'
Not sure whether anything can be done better via adding labels at some
stage.
As well as target labels, you can set labels in the alerting rules
themselves, for when an alert fires. That doesn't help you filter the alert
expr itself, but it can be useful when deciding how to route the
notification in alertmanager.
Target labels are a decent way to classify machines, e.g. target labels for
"development" and "production" mean that you can easily alert or dispatch
alerts differently for those two environments. But you should beware of
changing them frequently, because every time the set of labels on a metric
changes, it becomes a new timeseries. This makes it hard to follow the
history of the metric.
If you want to do really clever stuff like classifying hosts dynamically,
then you can do it by having *separate* timeseries for those
classifications:
https://www.robustperception.io/how-to-have-labels-for-machine-roles
https://www.robustperception.io/exposing-the-software-version-to-prometheus
https://www.robustperception.io/left-joins-in-promql
Again, unless you really need it, this is arguably getting "too clever" -
and it will make the actual alerting rules more complex.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/71be9938-aec8-4f42-a25a-2253bf8ffa08n%40googlegroups.com.