Hey Brian.
On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote:
That's tricky to get exactly right. You could try something like this
(untested):
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard'
failure alert should have triggered)
Therefore, this should alert if any scrape failed over 5 minutes, unless
all scrapes failed over 5 minutes.
Ah that seems a pretty smart idea.
And the for: is needed to make it actually "count", as the [5m] only looks
back 5m, but there, max_over_time(up[5m]) would have likely been still 1
while min_over_time(up[5m]) would already be 0, and if one had then e.g.
for: 0s, it would fire immediately.
There is a boundary condition where if the scraping fails for approximately
5 minutes you're not sure if the standard failure alert would have
triggered.
You mean like the above one wouldn't fire cause it thinks it's the
long-term alert, while that wouldn't fire either, because it has just
resolved then?
Hence it might need a bit of tweaking for robustness. To start with, just
make it over 6 minutes:
expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
for: 6m
That is, if max_over_time[6m] is zero, we're pretty sure that a standard
alert will have been triggered by then.
That one I don't quite understand.
What if e.g. the following scenario happens (with each line giving the
state 1m after the one before):
for=6 for=5
m -5 -4 -3 -2 -1 0 for min[6m] max[6m] result/short result/long
up: 1 1 1 1 1 0 1 0 1 pending pending
up: 1 1 1 1 0 0 2 0 1 pending pending
up: 1 1 1 0 0 0 3 0 1 pending pending
up: 1 1 0 0 0 0 4 0 1 pending pending
up: 1 0 0 0 0 0 5 0 1 pending fire
up: 0 0 0 0 0 1 6 0 1 fire clear
After 5m, the long term alert would fire, after that the scraping would
succeed again, but AFAIU the "special" alert for the short ones would still
be true at that point and then start to fire, despite all the previous 5
zeros have actually been reported as part of a long-down alert.
I'm still not quite convinced about the "for: 6m" and whether we might lose
an alert if there were a single failed scrape. Maybe this would be more
sensitive:
expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
for: 7m
but I think you might get some spurious alerts at the *end* of a period of
downtime.
That also seems quite complex. And I guess it might have the same possible
issue from above?
The same should be the case if one would do:
expr: min_over_time(up[6m]) == 0 unless max_over_time(up[5m]) == 0
for: 6m
It may be just 6m ago that there was a "0" (from a long alert) and the last
5m there would have been "1"s. So the short-alert would fire, despite it's
unclear whether the "0" 6m ago was really just a lonely one or the end of a
long-alert period.
Actually, I think, any case where the min_over_time goes further back than
the long-alert's for:-time should have that.
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[6m]) == 0
for: 5m
would also be broken, IMO, cause if 6m ago there was a "1", only the
min_over_time(up[5m]) == 0 would remain (and nothing would silence the
alert if needed)... if there 6m ago was a "0", it should effectively be the
same than using [5m]?
Isn't the problem from the very above already solved by placing both alerts
in the same rule group?
https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
says:
"Recording and alerting rules exist in a rule group. Rules within a group
are run sequentially at a regular interval, with the same evaluation time."
which I guess applies also to alert rules.
Not sure if I'm right, but I think if one places both rules in the same
group (and I think even the order shouldn't matter?), then the original:
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
with 5m being the "for:"-time of the long-alert should be guaranteed to
work... in the sense that if the above doesn't fire... the long-alert does.
Unless of course the grouping settings at alert manager cause trouble..
which I don't quite understand.... especially, once an alert fires, even if
just for short,... is it guaranteed that a notiication is sent?
Cause as I wrote before, that didn't seem to be the case.
Last but not least, if my assumption is true and your 1st version would
work if both alerts are in the same group... how would the interval then
matter? Would it still need to be the smallest scrape time (I guess so)?
Thanks,
Chris.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/d4ed892e-cd5e-45dd-ad8b-20dc2352416an%40googlegroups.com.