[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Brian Candler Wed, 10 May 2023 00:03:41 -0700

> Not sure if I'm right, but I think if one places both rules in the same 
group (and I think even the order shouldn't matter?), then the original:
>     expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
>     for: 5m
> with 5m being the "for:"-time of the long-alert should be guaranteed to 
work... in the sense that if the above doesn't fire... the long-alert > 
does.


It depends on the exact semantics of "for". e.g. take a simple case of 1 
minute rule evaluation interval. If you apply "for: 1m" then I guess that 
means the alert must be firing for two successive evaluations (otherwise, 
"for: 1m" would have no effect).

If so, then "for: 5m" means it must be firing for six successive 
evaluations.

But up[5m] only looks at samples wholly contained within a 5 minute window, 
and therefore will normally only look at 5 samples.  (If there is jitter in 
the sampling time, then occasionally it might look at 4 or 6 samples)

If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas

... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.

If you want to get to the bottom of this with certainty, you can write unit 
tests 
<https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/>
 
that try out these scenarios.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/12e68a80-7d90-4e91-838a-bae6a21ca3b1n%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to