[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Brian Candler Tue, 09 May 2023 00:55:27 -0700

That's tricky to get exactly right. You could try something like this 
(untested):


    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m

- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard' 
failure alert should have triggered)

Therefore, this should alert if any scrape failed over 5 minutes, unless 
all scrapes failed over 5 minutes.

There is a boundary condition where if the scraping fails for approximately 
5 minutes you're not sure if the standard failure alert would have 
triggered. Hence it might need a bit of tweaking for robustness. To start 
with, just make it over 6 minutes:

    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
    for: 6m

That is, if max_over_time[6m] is zero, we're pretty sure that a standard 
alert will have been triggered by then.

I'm still not quite convinced about the "for: 6m" and whether we might lose 
an alert if there were a single failed scrape. Maybe this would be more 
sensitive:

    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
    for: 7m

but I think you might get some spurious alerts at the *end* of a period of 
downtime.

On Tuesday, 9 May 2023 at 02:29:40 UTC+1 Christoph Anton Mitterer wrote:

> Hey.
>
> I have an alert rule like this:
>
> groups:
>   - name:       alerts_general
>     rules:
>     - alert: general_target-down
>       expr: 'up == 0'
>       for:  5m
>
> which is intended to notify about a target instance (respectively a 
> specific exporter on that) being down.
>
> There are also routes in alertmanager.yml which have some "higher" periods 
> for group_wait and group_interval and also distribute that resulting alerts 
> to the various receivers (e.g. depending on the instance that is affected).
>
>
> By chance I've noticed that some of our instances (or the networking) seem 
> to be a bit unstable and every now and so often, a single scrape or some 
> few fail.
>
> Since this does typically not mean that the exporter is down (in the above 
> sense) I wouldn't want that to cause a notification to be sent to people 
> responsible for the respective instances.
> But I would want to get one sent, even if only a single scrape fails, to 
> the local prometheus admin (me ^^), so that I can look further, what causes 
> the scrape failures.
>
>
>
> My (working) solution for that is:
> a) another alert rule like:
> groups:
>   - name:     alerts_general_single-scrapes
>     interval: 15s
>     rules:
>     - alert: general_target-down_single-scrapes      
>       expr: 
> 'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
>       for:  0s
>
> (With 15s being the smallest scrape time used by any jobs.)
>
> And a corresponding alertmanager route like:
>   - match:
>       alertname: general_target-down_single-scrapes
>     receiver:       admins_monitoring_no-resolved
>     group_by:       [alertname]
>     group_wait:     0s
>     group_interval: 1s
>
>
> The group_wait: 0s and group_interval: 1s seemed necessary, cause despite 
> of the for: 0s, it seems that alertmanager kind of checks again before 
> actually sending a notification... and when the alert is gone by then 
> (because there was e.g. only one single missing scrape) it wouldn't send 
> anything (despite the alert actually fired).
>
>
> That works so far... that is admins_monitoring_no-resolved get a 
> notification for every single failed scrape while all others only get them 
> when they fail for at least 5m.
>
> I even improved the above a bit, by clearing the alert for single failed 
> scrapes, when the one for long-term down starts firing via something like:
>       expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} 
> == 0 )  unless on (instance,job)  ( ALERTS{alertname="general_target-down", 
> alertstate="firing"} == 1 )'
>
>
> I wondered wheter this can be done better?
>
> Ideally I'd like to get notification for 
> general_target-down_single-scrapes only sent, if there would be no one for 
> general_target-down.
>
> That is, I don't care if the notification comes in late (by the above ~ 
> 5m), it just *needs* to come, unless - of course - the target is "really" 
> down (that is when general_target-down fires), in which case no 
> notification should go out for general_target-down_single-scrapes.
>
>
> I couldn't think of an easy way to get that. Any ideas?
>
>
> Thanks,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/237fda1f-89ce-419a-a54f-b9b12ea4d593n%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to