Look at container logs then.

Metrics include things like the number of notifications attempted, 
succeeded and failed.  Those would be the obvious first place to look.  
(For example: is it actually trying to send a mail? if so, is it succeeding 
or failing?)

Aside: vector(0) and vector(1) are the same for generating alerts. It's 
only the presence of a value that triggers an alert, the actual value 
itself can be anything.

On Monday, 27 June 2022 at 16:28:46 UTC+1 [email protected] wrote:

> Ok, added a rule with an expression of *vector(1)*, went live at 12:31, 
> when it fired 2 alerts  (?!), but then went completely silent until 15:36, 
> when it fired again 2x (so more than 3 h in). The alert has been stuck in 
> the *FIRING* state the whole time, as expected.
> Unfortunately the logs don't shed any light - there's nothing logged aside 
> from the bootstrap logs. It isn't a systemd process - it's run in a 
> container & there seems to be just a big executable in there.
> The meta-metrics contain quite a lot of data in there - any particulars I 
> should be looking for?
>
> Either way, I'm now inclined to believe that this is definitely an 
> *alertmanager* setting matter. As I was mentioning in my initial email, 
> I've already tweaked *group_wait,* *group_interval & **repeat_interval*, 
> but they probably didn't take effect, as I thought they would. So maybe 
> that's something I need to sort out. And better logging should help 
> understand all of that, which I still need to figure out how to do.
>
> Thank you very much for your help Brian!
>
> On Monday, 27 June 2022 at 09:59:59 UTC+1 Brian Candler wrote:
>
>> I suspect the easiest way to debug this is to focus on "*repeat_interval: 
>> 2m*".  Even if a single alert is statically firing, you should get the 
>> same notification resent every 2 minutes.  So don't worry about catching 
>> second instances of the same expr; just set a simple alerting expression 
>> which fires continuously, say just "expr: vector(0)", to find out why it's 
>> not resending.
>>
>> You can then look at logs from alertmanager (e.g. "journalctl -eu 
>> alertmanager" if running under systemd). You can also look at the metrics 
>> alertmanager itself generates:
>>
>>     curl localhost:9093/metrics | grep alertmanager
>>
>> Hopefully, one of these may give you a clue as to what's happening (e.g. 
>> maybe your mail system or other notification endpoint has some sort of rate 
>> limiting??).
>>
>> However, if the vector(0) expression *does* send repeated alerts 
>> successfully, then your problem is most likely something to do with your 
>> actual alerting expr, and you'll need to break it down into simpler pieces 
>> to debug it.
>>
>> Apart from that, all I can say is "it works for me™": if an alerting 
>> expression subsequently generates a second alert in its result vector, then 
>> I get another alert after group_interval.
>>
>> On Monday, 27 June 2022 at 09:39:45 UTC+1 [email protected] wrote:
>>
>>> Hi Brian,
>>>
>>> Thanks for your reply! To be honest, you can pretty much ignore that 
>>> first part of the expression, that doesn't change anything in the "repeat" 
>>> behaviour. In fact, we don't even have that bit at the moment, that's just 
>>> something I've been playing with in order to capture that very first 
>>> springing into existence of the metric, which isn't covered by the current 
>>> expression,  
>>> *sum(rate(error_counter{service="myservice",other="labels"}[1m])) 
>>> > 0'*.
>>> Also, I've already done the PromQL graphing that you suggested, I could 
>>> see those multiple lines that you were talking about, but then there was no 
>>> alert firing... 🤷‍♂️
>>>
>>> Any other pointers?
>>>
>>> Thanks,
>>> Ionel
>>>
>>> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote:
>>>
>>>> Try putting the whole alerting "expr" into the PromQL query browser, 
>>>> and switching to graph view.
>>>>
>>>> This will show you the alert vector graphically, with a separate line 
>>>> for each alert instance.  If this isn't showing multiple lines, then you 
>>>> won't receive multiple alerts.  Then you can break down your query into 
>>>> parts, try them individually, to try to understand why it's not working as 
>>>> you expect.
>>>>
>>>> Looking at just part of your expression:
>>>>
>>>> *sum(error_counter{service="myservice",other="labels"} unless 
>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0*
>>>>
>>>> And taking just the part inside sum():
>>>>
>>>> *error_counter{service="myservice",other="labels"} unless 
>>>> error_counter{service="myservice",other="labels"} offset 1m*
>>>>
>>>> This expression is weird. It will only generate a value when the error 
>>>> counter first springs into existence.  As soon as it has existed for more 
>>>> than 1 minute - even with value zero - then the "unless" cause will 
>>>> suppress the expression completely, i.e. it will be an empty instance 
>>>> vector.
>>>>
>>>> I think this is probably not what you want.  In any case it's not a 
>>>> good idea to have timeseries which come and go; it's very awkward to alert 
>>>> on a timeseries appearing or disappearing, and you may have problems with 
>>>> staleness, i.e. the timeseries may continue to exist for 5 minutes after 
>>>> you've stopped generating points in it.
>>>>
>>>> It's much better to have a timeseries which continues to exist.  That 
>>>> is, "error_counter" should spring into existence with value 0, and 
>>>> increment when errors occur, and stop incrementing when errors don't occur 
>>>> - but continue to keep the value it had before.
>>>>
>>>> If your error_counter timeseries *does* exist continuously, then this 
>>>> 'unless' clause is probably not what you want.
>>>>
>>>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] 
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm trying to set up some alerts that fire on critical errors, so I'm 
>>>>> aiming for immediate & consistent reporting for as much as possible.
>>>>>
>>>>> So for that matter, I defined the alert rule without a *for* clause:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *groups:- name: Test alerts  rules:  - alert: MyService Test Alert    
>>>>> expr: 'sum(error_counter{service="myservice",other="labels"} unless 
>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0     or 
>>>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>>>>>
>>>>> Prometheus is configured to scrape & evaluate at 10 s:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *global:  scrape_interval: 10s  scrape_timeout: 10s  
>>>>> evaluation_interval: 10s*
>>>>>
>>>>> And the alert manager (docker image 
>>>>> *quay.io/prometheus/alertmanager:v0.23.0 
>>>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured with 
>>>>> these parameters:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *route:  group_by: ['alertname', 'node_name']  group_wait: 30s  
>>>>> group_interval: 1m # used to be 5m  repeat_interval: 2m # used to be 3h*
>>>>>
>>>>> Now what happens when testing is this:
>>>>> - on the very first metric generated, the alert fires as expected;
>>>>> - on subsequent tests it stops firing;
>>>>> - *I kept on running a new test each minute for 20 minutes, but no 
>>>>> alert fired again*;
>>>>> - I can see the alert state going into *FIRING* in the alerts view in 
>>>>> the Prometheus UI;
>>>>> - I can see the metric values getting generated when executing the 
>>>>> expression query in the Prometheus UI;
>>>>>
>>>>> Redid the same test suite after a 2 hour break & exactly the same 
>>>>> thing happened, including the fact that* the alert fired on the first 
>>>>> test!*
>>>>>
>>>>> What am I missing here? How can I make the alert manager fire that 
>>>>> alert on repeated error metric hits? Ok, it doesn't have to be as soon as 
>>>>> 2m, but let's consider that for testing's sake.
>>>>>
>>>>> Pretty please, any advice is much appreciated!
>>>>>
>>>>> Kind regards,
>>>>> Ionel
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6cee6174-30d5-47af-b198-550c07f9c4b0n%40googlegroups.com.

Reply via email to