I suspect the easiest way to debug this is to focus on "*repeat_interval:
2m*". Even if a single alert is statically firing, you should get the same
notification resent every 2 minutes. So don't worry about catching second
instances of the same expr; just set a simple alerting expression which
fires continuously, say just "expr: vector(0)", to find out why it's not
resending.
You can then look at logs from alertmanager (e.g. "journalctl -eu
alertmanager" if running under systemd). You can also look at the metrics
alertmanager itself generates:
curl localhost:9093/metrics | grep alertmanager
Hopefully, one of these may give you a clue as to what's happening (e.g.
maybe your mail system or other notification endpoint has some sort of rate
limiting??).
However, if the vector(0) expression *does* send repeated alerts
successfully, then your problem is most likely something to do with your
actual alerting expr, and you'll need to break it down into simpler pieces
to debug it.
Apart from that, all I can say is "it works for me™": if an alerting
expression subsequently generates a second alert in its result vector, then
I get another alert after group_interval.
On Monday, 27 June 2022 at 09:39:45 UTC+1 [email protected] wrote:
> Hi Brian,
>
> Thanks for your reply! To be honest, you can pretty much ignore that first
> part of the expression, that doesn't change anything in the "repeat"
> behaviour. In fact, we don't even have that bit at the moment, that's just
> something I've been playing with in order to capture that very first
> springing into existence of the metric, which isn't covered by the current
> expression,
> *sum(rate(error_counter{service="myservice",other="labels"}[1m]))
> > 0'*.
> Also, I've already done the PromQL graphing that you suggested, I could
> see those multiple lines that you were talking about, but then there was no
> alert firing... 🤷♂️
>
> Any other pointers?
>
> Thanks,
> Ionel
>
> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote:
>
>> Try putting the whole alerting "expr" into the PromQL query browser, and
>> switching to graph view.
>>
>> This will show you the alert vector graphically, with a separate line for
>> each alert instance. If this isn't showing multiple lines, then you won't
>> receive multiple alerts. Then you can break down your query into parts,
>> try them individually, to try to understand why it's not working as you
>> expect.
>>
>> Looking at just part of your expression:
>>
>> *sum(error_counter{service="myservice",other="labels"} unless
>> error_counter{service="myservice",other="labels"} offset 1m) > 0*
>>
>> And taking just the part inside sum():
>>
>> *error_counter{service="myservice",other="labels"} unless
>> error_counter{service="myservice",other="labels"} offset 1m*
>>
>> This expression is weird. It will only generate a value when the error
>> counter first springs into existence. As soon as it has existed for more
>> than 1 minute - even with value zero - then the "unless" cause will
>> suppress the expression completely, i.e. it will be an empty instance
>> vector.
>>
>> I think this is probably not what you want. In any case it's not a good
>> idea to have timeseries which come and go; it's very awkward to alert on a
>> timeseries appearing or disappearing, and you may have problems with
>> staleness, i.e. the timeseries may continue to exist for 5 minutes after
>> you've stopped generating points in it.
>>
>> It's much better to have a timeseries which continues to exist. That is,
>> "error_counter" should spring into existence with value 0, and increment
>> when errors occur, and stop incrementing when errors don't occur - but
>> continue to keep the value it had before.
>>
>> If your error_counter timeseries *does* exist continuously, then this
>> 'unless' clause is probably not what you want.
>>
>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] wrote:
>>
>>> Hello,
>>>
>>> I'm trying to set up some alerts that fire on critical errors, so I'm
>>> aiming for immediate & consistent reporting for as much as possible.
>>>
>>> So for that matter, I defined the alert rule without a *for* clause:
>>>
>>>
>>>
>>>
>>>
>>>
>>> *groups:- name: Test alerts rules: - alert: MyService Test Alert
>>> expr: 'sum(error_counter{service="myservice",other="labels"} unless
>>> error_counter{service="myservice",other="labels"} offset 1m) > 0 or
>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>>>
>>> Prometheus is configured to scrape & evaluate at 10 s:
>>>
>>>
>>>
>>>
>>> *global: scrape_interval: 10s scrape_timeout: 10s
>>> evaluation_interval: 10s*
>>>
>>> And the alert manager (docker image
>>> *quay.io/prometheus/alertmanager:v0.23.0
>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured with
>>> these parameters:
>>>
>>>
>>>
>>>
>>>
>>> *route: group_by: ['alertname', 'node_name'] group_wait: 30s
>>> group_interval: 1m # used to be 5m repeat_interval: 2m # used to be 3h*
>>>
>>> Now what happens when testing is this:
>>> - on the very first metric generated, the alert fires as expected;
>>> - on subsequent tests it stops firing;
>>> - *I kept on running a new test each minute for 20 minutes, but no
>>> alert fired again*;
>>> - I can see the alert state going into *FIRING* in the alerts view in
>>> the Prometheus UI;
>>> - I can see the metric values getting generated when executing the
>>> expression query in the Prometheus UI;
>>>
>>> Redid the same test suite after a 2 hour break & exactly the same thing
>>> happened, including the fact that* the alert fired on the first test!*
>>>
>>> What am I missing here? How can I make the alert manager fire that alert
>>> on repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but
>>> let's consider that for testing's sake.
>>>
>>> Pretty please, any advice is much appreciated!
>>>
>>> Kind regards,
>>> Ionel
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/2d67aa69-c60c-4d78-b0bf-dc934ae72522n%40googlegroups.com.