Hey.
I have an alert rule like this:
groups:
- name: alerts_general
rules:
- alert: general_target-down
expr: 'up == 0'
for: 5m
which is intended to notify about a target instance (respectively a
specific exporter on that) being down.
There are also routes in alertmanager.yml which have some "higher" periods
for group_wait and group_interval and also distribute that resulting alerts
to the various receivers (e.g. depending on the instance that is affected).
By chance I've noticed that some of our instances (or the networking) seem
to be a bit unstable and every now and so often, a single scrape or some
few fail.
Since this does typically not mean that the exporter is down (in the above
sense) I wouldn't want that to cause a notification to be sent to people
responsible for the respective instances.
But I would want to get one sent, even if only a single scrape fails, to
the local prometheus admin (me ^^), so that I can look further, what causes
the scrape failures.
My (working) solution for that is:
a) another alert rule like:
groups:
- name: alerts_general_single-scrapes
interval: 15s
rules:
- alert: general_target-down_single-scrapes
expr:
'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
for: 0s
(With 15s being the smallest scrape time used by any jobs.)
And a corresponding alertmanager route like:
- match:
alertname: general_target-down_single-scrapes
receiver: admins_monitoring_no-resolved
group_by: [alertname]
group_wait: 0s
group_interval: 1s
The group_wait: 0s and group_interval: 1s seemed necessary, cause despite
of the for: 0s, it seems that alertmanager kind of checks again before
actually sending a notification... and when the alert is gone by then
(because there was e.g. only one single missing scrape) it wouldn't send
anything (despite the alert actually fired).
That works so far... that is admins_monitoring_no-resolved get a
notification for every single failed scrape while all others only get them
when they fail for at least 5m.
I even improved the above a bit, by clearing the alert for single failed
scrapes, when the one for long-term down starts firing via something like:
expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"}
== 0 ) unless on (instance,job) ( ALERTS{alertname="general_target-down",
alertstate="firing"} == 1 )'
I wondered wheter this can be done better?
Ideally I'd like to get notification for general_target-down_single-scrapes
only sent, if there would be no one for general_target-down.
That is, I don't care if the notification comes in late (by the above ~
5m), it just *needs* to come, unless - of course - the target is "really"
down (that is when general_target-down fires), in which case no
notification should go out for general_target-down_single-scrapes.
I couldn't think of an easy way to get that. Any ideas?
Thanks,
Chris.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/8af3cd3e-f3b9-4c0c-b799-ac7a420d8bb1n%40googlegroups.com.