[prometheus-users] Re: An alert fires twice even an event occurres only once

Brian Candler Mon, 16 Jan 2023 10:19:42 -0800

> (1) We would like avoid such architecture. In this scenario we keep one 
region without local alertmanager. It means that we could lost alerts in 
case lost connection between that region and regions where alertmanager 
cluster is configured.


But if you've totally lost connectivity from this region, then even if you 
try to send a message to PagerDuty or OpsGenie or whatever, won't that fail 
too?

On Monday, 16 January 2023 at 14:32:12 UTC LukaszSz wrote:

> Hi ,
>
> (1) We would like avoid such architecture. In this scenario we keep one 
> region without local alertmanager. It means that we could lost alerts in 
> case lost connection between that region and regions where alertmanager 
> cluster is configured.
>
> (2) It looks very promising. Currently one blocking point is lack of 
> fronted  where we can set a silence. I saw your previous posts about Karma. 
> We are going to test this direction.
>
> Our other ideas are:
>
> (3) Reduce current AM cluster from 8 to 4 nodes (1 AM per region) 
> (4) If (3) not help we want tweak/play with gossip to improve AM nodes 
> communication. Do you or anyone has experience with gossip and some best 
> practice in AM HA ?
>
> Thanks 
>
>
> On Sunday, January 15, 2023 at 11:58:47 AM UTC+1 Brian Candler wrote:
>
>> I wouldn't have thought that a few hundred ms of latency would make any 
>> difference.
>>
>> I am however worried about the gossiping.  If this is one monster-sized 
>> cluster, then all 8 nodes should be communicating with every other 7 nodes.
>>
>> I'd say this is a bad design.  Either:
>>
>> 1. Have a single global alertmanager cluster, with 2 nodes - that will 
>> give you excellent high availability for your alerting.  (How often do 
>> expect two regions to go offline simultaneously?)  Or 3 nodes if your 
>> management absolutely insists on it.  (But this isn't the sort of cluster 
>> that needs to maintain a quorum).
>>
>> Or:
>>
>> 2. Completely separate the regions.  Have one alertmanager cluster in 
>> region A, one cluster in region B, one cluster in region C.  Have the 
>> prometheus instances in region A only talking to the alertmanager instances 
>> in region A, and so on.  In this case, each region sends its alerts 
>> completely independently.
>>
>> There is little benefit in option (2) unless there are tight restrictions 
>> on inter-region communication; it gives you a lot more stuff to manage.  If 
>> you need to go this route, then having a frontend like Karma or alerta.io 
>> may be helpful.
>>
>> On Friday, 13 January 2023 at 14:53:14 UTC LukaszSz wrote:
>>
>>> Interesting. Seems that the alertmanagers are spread over 3 different 
>>> regions ( 2xAsia, 2xUSA,4xEurope).
>>> Maybe there is some latency problem between them like latency in gossip 
>>> messages ?
>>>
>>> On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote:
>>>
>>>> That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd 
>>>> say 2 or 3 would be better - spread over different regions)
>>>>
>>>> On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
>>>>
>>>>> Yes. The prometheus server is configured to communicate with  all 
>>>>> alertmanagers ( sorry there are 8 alertmanagers ):
>>>>>
>>>>> alerting:
>>>>>   alert_relabel_configs:
>>>>>   - action: labeldrop
>>>>>     regex: "^prometheus_server$"
>>>>>   alertmanagers:
>>>>>   - static_configs:
>>>>>     - targets:
>>>>>       - alertmanager1:9093
>>>>>       - alertmanager2:9093
>>>>>       - alertmanager3:9093
>>>>>       - alertmanager4:9093
>>>>>       - alertmanager5:9093
>>>>>       - alertmanager6:9093
>>>>>       - alertmanager7:9093
>>>>>       - alertmanager8:9093 
>>>>>
>>>>> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>>>>>
>>>>>> Yes, but have you configured the prometheus (the one which has 
>>>>>> alerting rules) to have all four alertmanagers as its destination?
>>>>>>
>>>>>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>>>>>
>>>>>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster 
>>>>>>> and this event is visible on my 4 alertmanagers.
>>>>>>> Problem which I described is that an alerts are firing twice and it 
>>>>>>> generates duplication. 
>>>>>>>
>>>>>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>>>>>
>>>>>>>> Are the alertmanagers clustered?  Then you should configure 
>>>>>>>> prometheus to deliver the alert to *all* alertmanagers.
>>>>>>>>
>>>>>>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
>>>>>>>>> each. 
>>>>>>>>> Everything works fine but very often I experience issue that an 
>>>>>>>>> alert is firing again even the event is already resolved by 
>>>>>>>>> alertmanager.
>>>>>>>>>
>>>>>>>>> Below logs from example event(Chrony_Service_Down) recorded by 
>>>>>>>>> alertmanager:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ############################################################################################################
>>>>>>>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>>>>>>>>> component=dispatcher msg="Received alert" 
>>>>>>>>> alert=Chrony_Service_Down[d8c020a][active]
>>>>>>>>>
>>>>>>>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
>>>>>>>>> component=nflog 
>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>  
>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>> timestamp:<seconds:1673347759 nanos:262824014 > 
>>>>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 
>>>>>>>>> nanos:262824014 > "
>>>>>>>>>
>>>>>>>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>>>>>>>>> component=dispatcher msg="Received alert" 
>>>>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>>>>
>>>>>>>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
>>>>>>>>> component=nflog 
>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>  
>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>> timestamp:<seconds:1673347888 nanos:897562679 > 
>>>>>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 
>>>>>>>>> nanos:897562679 > "
>>>>>>>>>
>>>>>>>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
>>>>>>>>> component=nflog 
>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>  
>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>> timestamp:<seconds:1673347909 nanos:649205670 > 
>>>>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 
>>>>>>>>> nanos:649205670 > "
>>>>>>>>>
>>>>>>>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
>>>>>>>>> component=nflog 
>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>  
>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>> timestamp:<seconds:1673347919 nanos:137020780 > 
>>>>>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 
>>>>>>>>> nanos:137020780 > "
>>>>>>>>>
>>>>>>>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>>>>>>>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>>>>>>>>> component=dispatcher msg="Received alert" 
>>>>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>>>>
>>>>>>>>> #############################################################################################################
>>>>>>>>>
>>>>>>>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager 
>>>>>>>>> fired alert second time even minute ago (Jan 10 10:50:48) the alert 
>>>>>>>>> was 
>>>>>>>>> marked as resolved.
>>>>>>>>> Such behavior generates duplicate alert in our system which is 
>>>>>>>>> quite annoying in our scale.
>>>>>>>>>
>>>>>>>>> What is worth to mention:
>>>>>>>>> - For test purpose the event is scrapped by 4 Promethues 
>>>>>>>>> servers(default) but alert rule is evaluated by one Promethues.
>>>>>>>>> - The event occurres only once so there is no flapping which might 
>>>>>>>>> cause another alert firing.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/387aca8b-10c1-4ebf-964b-2cc956150a87n%40googlegroups.com.

[prometheus-users] Re: An alert fires twice even an event occurres only once

Reply via email to