Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
this event is visible on my 4 alertmanagers.
Problem which I described is that an alerts are firing twice and it 
generates duplication. 

On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:

> Are the alertmanagers clustered?  Then you should configure prometheus to 
> deliver the alert to *all* alertmanagers.
>
> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>
>> Hi guys,
>>
>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
>> Everything works fine but very often I experience issue that an alert is 
>> firing again even the event is already resolved by alertmanager.
>>
>> Below logs from example event(Chrony_Service_Down) recorded by 
>> alertmanager:
>>
>>
>> ############################################################################################################
>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][active]
>>
>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>> integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 > 
>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 
>> nanos:262824014 > "
>>
>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>> integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 > 
>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 
>> nanos:897562679 > "
>>
>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>> integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 > 
>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 
>> nanos:649205670 > "
>>
>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>> integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 > 
>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 
>> nanos:137020780 > "
>>
>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> #############################################################################################################
>>
>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
>> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
>> resolved.
>> Such behavior generates duplicate alert in our system which is quite 
>> annoying in our scale.
>>
>> What is worth to mention:
>> - For test purpose the event is scrapped by 4 Promethues servers(default) 
>> but alert rule is evaluated by one Promethues.
>> - The event occurres only once so there is no flapping which might cause 
>> another alert firing.
>>
>> Thanks
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1d85847d-6096-4124-9ee6-4ece4cb21249n%40googlegroups.com.

Reply via email to