[prometheus-users] Re: An alert fires twice even an event occurres only once

LukaszSz Fri, 13 Jan 2023 06:53:19 -0800

Interesting. Seems that the alertmanagers are spread over 3 different 
regions ( 2xAsia, 2xUSA,4xEurope).
Maybe there is some latency problem between them like latency in gossip 
messages ?


On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote:

> That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 
> or 3 would be better - spread over different regions)
>
> On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
>
>> Yes. The prometheus server is configured to communicate with  all 
>> alertmanagers ( sorry there are 8 alertmanagers ):
>>
>> alerting:
>>   alert_relabel_configs:
>>   - action: labeldrop
>>     regex: "^prometheus_server$"
>>   alertmanagers:
>>   - static_configs:
>>     - targets:
>>       - alertmanager1:9093
>>       - alertmanager2:9093
>>       - alertmanager3:9093
>>       - alertmanager4:9093
>>       - alertmanager5:9093
>>       - alertmanager6:9093
>>       - alertmanager7:9093
>>       - alertmanager8:9093 
>>
>> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>>
>>> Yes, but have you configured the prometheus (the one which has alerting 
>>> rules) to have all four alertmanagers as its destination?
>>>
>>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>>
>>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster 
>>>> and this event is visible on my 4 alertmanagers.
>>>> Problem which I described is that an alerts are firing twice and it 
>>>> generates duplication. 
>>>>
>>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>>
>>>>> Are the alertmanagers clustered?  Then you should configure prometheus 
>>>>> to deliver the alert to *all* alertmanagers.
>>>>>
>>>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
>>>>>> each. 
>>>>>> Everything works fine but very often I experience issue that an alert 
>>>>>> is firing again even the event is already resolved by alertmanager.
>>>>>>
>>>>>> Below logs from example event(Chrony_Service_Down) recorded by 
>>>>>> alertmanager:
>>>>>>
>>>>>>
>>>>>> ############################################################################################################
>>>>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>>>>>> component=dispatcher msg="Received alert" 
>>>>>> alert=Chrony_Service_Down[d8c020a][active]
>>>>>>
>>>>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
>>>>>> component=nflog 
>>>>>> msg="gossiping new entry" 
>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>  
>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 
>>>>>> > 
>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 
>>>>>> nanos:262824014 > "
>>>>>>
>>>>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>>>>>> component=dispatcher msg="Received alert" 
>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>
>>>>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
>>>>>> component=nflog 
>>>>>> msg="gossiping new entry" 
>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>  
>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 
>>>>>> > 
>>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 
>>>>>> nanos:897562679 > "
>>>>>>
>>>>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
>>>>>> component=nflog 
>>>>>> msg="gossiping new entry" 
>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>  
>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 
>>>>>> > 
>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 
>>>>>> nanos:649205670 > "
>>>>>>
>>>>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
>>>>>> component=nflog 
>>>>>> msg="gossiping new entry" 
>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>  
>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 
>>>>>> > 
>>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 
>>>>>> nanos:137020780 > "
>>>>>>
>>>>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>>>>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>>>>>> component=dispatcher msg="Received alert" 
>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>
>>>>>> #############################################################################################################
>>>>>>
>>>>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager 
>>>>>> fired alert second time even minute ago (Jan 10 10:50:48) the alert was 
>>>>>> marked as resolved.
>>>>>> Such behavior generates duplicate alert in our system which is quite 
>>>>>> annoying in our scale.
>>>>>>
>>>>>> What is worth to mention:
>>>>>> - For test purpose the event is scrapped by 4 Promethues 
>>>>>> servers(default) but alert rule is evaluated by one Promethues.
>>>>>> - The event occurres only once so there is no flapping which might 
>>>>>> cause another alert firing.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6ebfb9e2-062d-46df-b1d7-16e0a1468cd9n%40googlegroups.com.

[prometheus-users] Re: An alert fires twice even an event occurres only once

Reply via email to