Hi Brian,

Thanks for your reply! To be honest, you can pretty much ignore that first 
part of the expression, that doesn't change anything in the "repeat" 
behaviour. In fact, we don't even have that bit at the moment, that's just 
something I've been playing with in order to capture that very first 
springing into existence of the metric, which isn't covered by the current 
expression,  *sum(rate(error_counter{service="myservice",other="labels"}[1m])) 
> 0'*.
Also, I've already done the PromQL graphing that you suggested, I could see 
those multiple lines that you were talking about, but then there was no 
alert firing... 🤷‍♂️

Any other pointers?

Thanks,
Ionel

On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote:

> Try putting the whole alerting "expr" into the PromQL query browser, and 
> switching to graph view.
>
> This will show you the alert vector graphically, with a separate line for 
> each alert instance.  If this isn't showing multiple lines, then you won't 
> receive multiple alerts.  Then you can break down your query into parts, 
> try them individually, to try to understand why it's not working as you 
> expect.
>
> Looking at just part of your expression:
>
> *sum(error_counter{service="myservice",other="labels"} unless 
> error_counter{service="myservice",other="labels"} offset 1m) > 0*
>
> And taking just the part inside sum():
>
> *error_counter{service="myservice",other="labels"} unless 
> error_counter{service="myservice",other="labels"} offset 1m*
>
> This expression is weird. It will only generate a value when the error 
> counter first springs into existence.  As soon as it has existed for more 
> than 1 minute - even with value zero - then the "unless" cause will 
> suppress the expression completely, i.e. it will be an empty instance 
> vector.
>
> I think this is probably not what you want.  In any case it's not a good 
> idea to have timeseries which come and go; it's very awkward to alert on a 
> timeseries appearing or disappearing, and you may have problems with 
> staleness, i.e. the timeseries may continue to exist for 5 minutes after 
> you've stopped generating points in it.
>
> It's much better to have a timeseries which continues to exist.  That is, 
> "error_counter" should spring into existence with value 0, and increment 
> when errors occur, and stop incrementing when errors don't occur - but 
> continue to keep the value it had before.
>
> If your error_counter timeseries *does* exist continuously, then this 
> 'unless' clause is probably not what you want.
>
> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] wrote:
>
>> Hello,
>>
>> I'm trying to set up some alerts that fire on critical errors, so I'm 
>> aiming for immediate & consistent reporting for as much as possible.
>>
>> So for that matter, I defined the alert rule without a *for* clause:
>>
>>
>>
>>
>>
>>
>> *groups:- name: Test alerts  rules:  - alert: MyService Test Alert    
>> expr: 'sum(error_counter{service="myservice",other="labels"} unless 
>> error_counter{service="myservice",other="labels"} offset 1m) > 0     or 
>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>>
>> Prometheus is configured to scrape & evaluate at 10 s:
>>
>>
>>
>>
>> *global:  scrape_interval: 10s  scrape_timeout: 10s  evaluation_interval: 
>> 10s*
>>
>> And the alert manager (docker image *quay.io/prometheus/alertmanager:v0.23.0 
>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured with 
>> these parameters:
>>
>>
>>
>>
>>
>> *route:  group_by: ['alertname', 'node_name']  group_wait: 30s  
>> group_interval: 1m # used to be 5m  repeat_interval: 2m # used to be 3h*
>>
>> Now what happens when testing is this:
>> - on the very first metric generated, the alert fires as expected;
>> - on subsequent tests it stops firing;
>> - *I kept on running a new test each minute for 20 minutes, but no alert 
>> fired again*;
>> - I can see the alert state going into *FIRING* in the alerts view in 
>> the Prometheus UI;
>> - I can see the metric values getting generated when executing the 
>> expression query in the Prometheus UI;
>>
>> Redid the same test suite after a 2 hour break & exactly the same thing 
>> happened, including the fact that* the alert fired on the first test!*
>>
>> What am I missing here? How can I make the alert manager fire that alert 
>> on repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but 
>> let's consider that for testing's sake.
>>
>> Pretty please, any advice is much appreciated!
>>
>> Kind regards,
>> Ionel
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/cd7ad41a-df6f-40bc-9500-6c11fa1a93ben%40googlegroups.com.

Reply via email to