Ok, added a rule with an expression of *vector(1)*, went live at 12:31, when it fired 2 alerts (?!), but then went completely silent until 15:36, when it fired again 2x (so more than 3 h in). The alert has been stuck in the *FIRING* state the whole time, as expected. Unfortunately the logs don't shed any light - there's nothing logged aside from the bootstrap logs. It isn't a systemd process - it's run in a container & there seems to be just a big executable in there. The meta-metrics contain quite a lot of data in there - any particulars I should be looking for?
Either way, I'm now inclined to believe that this is definitely an *alertmanager* setting matter. As I was mentioning in my initial email, I've already tweaked *group_wait,* *group_interval & **repeat_interval*, but they probably didn't take effect, as I thought they would. So maybe that's something I need to sort out. And better logging should help understand all of that, which I still need to figure out how to do. Thank you very much for your help Brian! On Monday, 27 June 2022 at 09:59:59 UTC+1 Brian Candler wrote: > I suspect the easiest way to debug this is to focus on "*repeat_interval: > 2m*". Even if a single alert is statically firing, you should get the > same notification resent every 2 minutes. So don't worry about catching > second instances of the same expr; just set a simple alerting expression > which fires continuously, say just "expr: vector(0)", to find out why it's > not resending. > > You can then look at logs from alertmanager (e.g. "journalctl -eu > alertmanager" if running under systemd). You can also look at the metrics > alertmanager itself generates: > > curl localhost:9093/metrics | grep alertmanager > > Hopefully, one of these may give you a clue as to what's happening (e.g. > maybe your mail system or other notification endpoint has some sort of rate > limiting??). > > However, if the vector(0) expression *does* send repeated alerts > successfully, then your problem is most likely something to do with your > actual alerting expr, and you'll need to break it down into simpler pieces > to debug it. > > Apart from that, all I can say is "it works for me™": if an alerting > expression subsequently generates a second alert in its result vector, then > I get another alert after group_interval. > > On Monday, 27 June 2022 at 09:39:45 UTC+1 [email protected] wrote: > >> Hi Brian, >> >> Thanks for your reply! To be honest, you can pretty much ignore that >> first part of the expression, that doesn't change anything in the "repeat" >> behaviour. In fact, we don't even have that bit at the moment, that's just >> something I've been playing with in order to capture that very first >> springing into existence of the metric, which isn't covered by the current >> expression, >> *sum(rate(error_counter{service="myservice",other="labels"}[1m])) >> > 0'*. >> Also, I've already done the PromQL graphing that you suggested, I could >> see those multiple lines that you were talking about, but then there was no >> alert firing... 🤷♂️ >> >> Any other pointers? >> >> Thanks, >> Ionel >> >> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote: >> >>> Try putting the whole alerting "expr" into the PromQL query browser, and >>> switching to graph view. >>> >>> This will show you the alert vector graphically, with a separate line >>> for each alert instance. If this isn't showing multiple lines, then you >>> won't receive multiple alerts. Then you can break down your query into >>> parts, try them individually, to try to understand why it's not working as >>> you expect. >>> >>> Looking at just part of your expression: >>> >>> *sum(error_counter{service="myservice",other="labels"} unless >>> error_counter{service="myservice",other="labels"} offset 1m) > 0* >>> >>> And taking just the part inside sum(): >>> >>> *error_counter{service="myservice",other="labels"} unless >>> error_counter{service="myservice",other="labels"} offset 1m* >>> >>> This expression is weird. It will only generate a value when the error >>> counter first springs into existence. As soon as it has existed for more >>> than 1 minute - even with value zero - then the "unless" cause will >>> suppress the expression completely, i.e. it will be an empty instance >>> vector. >>> >>> I think this is probably not what you want. In any case it's not a good >>> idea to have timeseries which come and go; it's very awkward to alert on a >>> timeseries appearing or disappearing, and you may have problems with >>> staleness, i.e. the timeseries may continue to exist for 5 minutes after >>> you've stopped generating points in it. >>> >>> It's much better to have a timeseries which continues to exist. That >>> is, "error_counter" should spring into existence with value 0, and >>> increment when errors occur, and stop incrementing when errors don't occur >>> - but continue to keep the value it had before. >>> >>> If your error_counter timeseries *does* exist continuously, then this >>> 'unless' clause is probably not what you want. >>> >>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 [email protected] wrote: >>> >>>> Hello, >>>> >>>> I'm trying to set up some alerts that fire on critical errors, so I'm >>>> aiming for immediate & consistent reporting for as much as possible. >>>> >>>> So for that matter, I defined the alert rule without a *for* clause: >>>> >>>> >>>> >>>> >>>> >>>> >>>> *groups:- name: Test alerts rules: - alert: MyService Test Alert >>>> expr: 'sum(error_counter{service="myservice",other="labels"} unless >>>> error_counter{service="myservice",other="labels"} offset 1m) > 0 or >>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'* >>>> >>>> Prometheus is configured to scrape & evaluate at 10 s: >>>> >>>> >>>> >>>> >>>> *global: scrape_interval: 10s scrape_timeout: 10s >>>> evaluation_interval: 10s* >>>> >>>> And the alert manager (docker image >>>> *quay.io/prometheus/alertmanager:v0.23.0 >>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured with >>>> these parameters: >>>> >>>> >>>> >>>> >>>> >>>> *route: group_by: ['alertname', 'node_name'] group_wait: 30s >>>> group_interval: 1m # used to be 5m repeat_interval: 2m # used to be 3h* >>>> >>>> Now what happens when testing is this: >>>> - on the very first metric generated, the alert fires as expected; >>>> - on subsequent tests it stops firing; >>>> - *I kept on running a new test each minute for 20 minutes, but no >>>> alert fired again*; >>>> - I can see the alert state going into *FIRING* in the alerts view in >>>> the Prometheus UI; >>>> - I can see the metric values getting generated when executing the >>>> expression query in the Prometheus UI; >>>> >>>> Redid the same test suite after a 2 hour break & exactly the same thing >>>> happened, including the fact that* the alert fired on the first test!* >>>> >>>> What am I missing here? How can I make the alert manager fire that >>>> alert on repeated error metric hits? Ok, it doesn't have to be as soon as >>>> 2m, but let's consider that for testing's sake. >>>> >>>> Pretty please, any advice is much appreciated! >>>> >>>> Kind regards, >>>> Ionel >>>> >>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5661602d-ed1d-41ec-b816-61b6c9b3736bn%40googlegroups.com.

