[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Christoph Anton Mitterer Tue, 09 May 2023 18:47:30 -0700

Hey Brian.

On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote:


That's tricky to get exactly right. You could try something like this 
(untested):

    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m

- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard' 
failure alert should have triggered)

Therefore, this should alert if any scrape failed over 5 minutes, unless 
all scrapes failed over 5 minutes.


Ah that seems a pretty smart idea.

And the for: is needed to make it actually "count", as the [5m] only looks 
back 5m, but there, max_over_time(up[5m]) would have likely been still 1 
while min_over_time(up[5m]) would already be 0, and if one had then e.g. 
for: 0s, it would fire immediately.
 

There is a boundary condition where if the scraping fails for approximately 
5 minutes you're not sure if the standard failure alert would have 
triggered.


You mean like the above one wouldn't fire cause it thinks it's the 
long-term alert, while that wouldn't fire either, because it has just 
resolved then?
 
 

Hence it might need a bit of tweaking for robustness. To start with, just 
make it over 6 minutes:

    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
    for: 6m

That is, if max_over_time[6m] is zero, we're pretty sure that a standard 
alert will have been triggered by then.


That one I don't quite understand.
What if e.g. the following scenario happens (with each line giving the 
state 1m after the one before):

                                                  for=6           for=5
m   -5 -4 -3 -2 -1  0   for     min[6m] max[6m] result/short    result/long
up:  1  1  1  1  1  0   1       0       1       pending         pending 
up:  1  1  1  1  0  0   2       0       1       pending         pending
up:  1  1  1  0  0  0   3       0       1       pending         pending
up:  1  1  0  0  0  0   4       0       1       pending         pending 
up:  1  0  0  0  0  0   5       0       1       pending         fire
up:  0  0  0  0  0  1   6       0       1       fire            clear

After 5m, the long term alert would fire, after that the scraping would 
succeed again, but AFAIU the "special" alert for the short ones would still 
be true at that point and then start to fire, despite all the previous 5 
zeros have actually been reported as part of a long-down alert.


I'm still not quite convinced about the "for: 6m" and whether we might lose 
an alert if there were a single failed scrape. Maybe this would be more 
sensitive:

    expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
    for: 7m

but I think you might get some spurious alerts at the *end* of a period of 
downtime.


That also seems quite complex. And I guess it might have the same possible 
issue from above?

The same should be the case if one would do:
    expr: min_over_time(up[6m]) == 0 unless max_over_time(up[5m]) == 0
    for: 6m
It may be just 6m ago that there was a "0" (from a long alert) and the last 
5m there would have been "1"s. So the short-alert would fire, despite it's 
unclear whether the "0" 6m ago was really just a lonely one or the end of a 
long-alert period.

Actually, I think, any case where the min_over_time goes further back than 
the long-alert's for:-time should have that.


    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[6m]) == 0
    for: 5m
would also be broken, IMO, cause if 6m ago there was a "1", only the 
min_over_time(up[5m]) == 0 would remain (and nothing would silence the 
alert if needed)... if there 6m ago was a "0", it should effectively be the 
same than using [5m]?


Isn't the problem from the very above already solved by placing both alerts 
in the same rule group?

https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ 
says:
"Recording and alerting rules exist in a rule group. Rules within a group 
are run sequentially at a regular interval, with the same evaluation time."
which I guess applies also to alert rules.

Not sure if I'm right, but I think if one places both rules in the same 
group (and I think even the order shouldn't matter?), then the original:
    expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
    for: 5m
with 5m being the "for:"-time of the long-alert should be guaranteed to 
work... in the sense that if the above doesn't fire... the long-alert does.

Unless of course the grouping settings at alert manager cause trouble.. 
which I don't quite understand.... especially, once an alert fires, even if 
just for short,... is it guaranteed that a notiication is sent?
Cause as I wrote before, that didn't seem to be the case.

Last but not least, if my assumption is true and your 1st version would 
work if both alerts are in the same group... how would the interval then 
matter? Would it still need to be the smallest scrape time (I guess so)?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d4ed892e-cd5e-45dd-ad8b-20dc2352416an%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to