That's tricky to get exactly right. You could try something like this
(untested):
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard'
failure alert should have triggered)
Therefore, this should alert if any scrape failed over 5 minutes, unless
all scrapes failed over 5 minutes.
There is a boundary condition where if the scraping fails for approximately
5 minutes you're not sure if the standard failure alert would have
triggered. Hence it might need a bit of tweaking for robustness. To start
with, just make it over 6 minutes:
expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
for: 6m
That is, if max_over_time[6m] is zero, we're pretty sure that a standard
alert will have been triggered by then.
I'm still not quite convinced about the "for: 6m" and whether we might lose
an alert if there were a single failed scrape. Maybe this would be more
sensitive:
expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
for: 7m
but I think you might get some spurious alerts at the *end* of a period of
downtime.
On Tuesday, 9 May 2023 at 02:29:40 UTC+1 Christoph Anton Mitterer wrote:
> Hey.
>
> I have an alert rule like this:
>
> groups:
> - name: alerts_general
> rules:
> - alert: general_target-down
> expr: 'up == 0'
> for: 5m
>
> which is intended to notify about a target instance (respectively a
> specific exporter on that) being down.
>
> There are also routes in alertmanager.yml which have some "higher" periods
> for group_wait and group_interval and also distribute that resulting alerts
> to the various receivers (e.g. depending on the instance that is affected).
>
>
> By chance I've noticed that some of our instances (or the networking) seem
> to be a bit unstable and every now and so often, a single scrape or some
> few fail.
>
> Since this does typically not mean that the exporter is down (in the above
> sense) I wouldn't want that to cause a notification to be sent to people
> responsible for the respective instances.
> But I would want to get one sent, even if only a single scrape fails, to
> the local prometheus admin (me ^^), so that I can look further, what causes
> the scrape failures.
>
>
>
> My (working) solution for that is:
> a) another alert rule like:
> groups:
> - name: alerts_general_single-scrapes
> interval: 15s
> rules:
> - alert: general_target-down_single-scrapes
> expr:
> 'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
> for: 0s
>
> (With 15s being the smallest scrape time used by any jobs.)
>
> And a corresponding alertmanager route like:
> - match:
> alertname: general_target-down_single-scrapes
> receiver: admins_monitoring_no-resolved
> group_by: [alertname]
> group_wait: 0s
> group_interval: 1s
>
>
> The group_wait: 0s and group_interval: 1s seemed necessary, cause despite
> of the for: 0s, it seems that alertmanager kind of checks again before
> actually sending a notification... and when the alert is gone by then
> (because there was e.g. only one single missing scrape) it wouldn't send
> anything (despite the alert actually fired).
>
>
> That works so far... that is admins_monitoring_no-resolved get a
> notification for every single failed scrape while all others only get them
> when they fail for at least 5m.
>
> I even improved the above a bit, by clearing the alert for single failed
> scrapes, when the one for long-term down starts firing via something like:
> expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"}
> == 0 ) unless on (instance,job) ( ALERTS{alertname="general_target-down",
> alertstate="firing"} == 1 )'
>
>
> I wondered wheter this can be done better?
>
> Ideally I'd like to get notification for
> general_target-down_single-scrapes only sent, if there would be no one for
> general_target-down.
>
> That is, I don't care if the notification comes in late (by the above ~
> 5m), it just *needs* to come, unless - of course - the target is "really"
> down (that is when general_target-down fires), in which case no
> notification should go out for general_target-down_single-scrapes.
>
>
> I couldn't think of an easy way to get that. Any ideas?
>
>
> Thanks,
> Chris.
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/237fda1f-89ce-419a-a54f-b9b12ea4d593n%40googlegroups.com.