>But if you've totally lost connectivity from this region, then even if you 
try to send a message to PagerDuty or OpsGenie or whatever, won't that fail 
too?
That is true. 

Nevertheless what I did so far reduced number of nodes in cluster from 8 to 
4 - in every region we have one alertmanager node now. After one week no 
duplication observed. I will keep this config for the next few weeks.

On Monday, January 16, 2023 at 7:19:37 PM UTC+1 Brian Candler wrote:

> > (1) We would like avoid such architecture. In this scenario we keep one 
> region without local alertmanager. It means that we could lost alerts in 
> case lost connection between that region and regions where alertmanager 
> cluster is configured.
>
> But if you've totally lost connectivity from this region, then even if you 
> try to send a message to PagerDuty or OpsGenie or whatever, won't that fail 
> too?
>
> On Monday, 16 January 2023 at 14:32:12 UTC LukaszSz wrote:
>
>> Hi ,
>>
>> (1) We would like avoid such architecture. In this scenario we keep one 
>> region without local alertmanager. It means that we could lost alerts in 
>> case lost connection between that region and regions where alertmanager 
>> cluster is configured.
>>
>> (2) It looks very promising. Currently one blocking point is lack of 
>> fronted  where we can set a silence. I saw your previous posts about Karma. 
>> We are going to test this direction.
>>
>> Our other ideas are:
>>
>> (3) Reduce current AM cluster from 8 to 4 nodes (1 AM per region) 
>> (4) If (3) not help we want tweak/play with gossip to improve AM nodes 
>> communication. Do you or anyone has experience with gossip and some best 
>> practice in AM HA ?
>>
>> Thanks 
>>
>>
>> On Sunday, January 15, 2023 at 11:58:47 AM UTC+1 Brian Candler wrote:
>>
>>> I wouldn't have thought that a few hundred ms of latency would make any 
>>> difference.
>>>
>>> I am however worried about the gossiping.  If this is one monster-sized 
>>> cluster, then all 8 nodes should be communicating with every other 7 nodes.
>>>
>>> I'd say this is a bad design.  Either:
>>>
>>> 1. Have a single global alertmanager cluster, with 2 nodes - that will 
>>> give you excellent high availability for your alerting.  (How often do 
>>> expect two regions to go offline simultaneously?)  Or 3 nodes if your 
>>> management absolutely insists on it.  (But this isn't the sort of cluster 
>>> that needs to maintain a quorum).
>>>
>>> Or:
>>>
>>> 2. Completely separate the regions.  Have one alertmanager cluster in 
>>> region A, one cluster in region B, one cluster in region C.  Have the 
>>> prometheus instances in region A only talking to the alertmanager instances 
>>> in region A, and so on.  In this case, each region sends its alerts 
>>> completely independently.
>>>
>>> There is little benefit in option (2) unless there are tight 
>>> restrictions on inter-region communication; it gives you a lot more stuff 
>>> to manage.  If you need to go this route, then having a frontend like Karma 
>>> or alerta.io may be helpful.
>>>
>>> On Friday, 13 January 2023 at 14:53:14 UTC LukaszSz wrote:
>>>
>>>> Interesting. Seems that the alertmanagers are spread over 3 different 
>>>> regions ( 2xAsia, 2xUSA,4xEurope).
>>>> Maybe there is some latency problem between them like latency in gossip 
>>>> messages ?
>>>>
>>>> On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote:
>>>>
>>>>> That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd 
>>>>> say 2 or 3 would be better - spread over different regions)
>>>>>
>>>>> On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
>>>>>
>>>>>> Yes. The prometheus server is configured to communicate with  all 
>>>>>> alertmanagers ( sorry there are 8 alertmanagers ):
>>>>>>
>>>>>> alerting:
>>>>>>   alert_relabel_configs:
>>>>>>   - action: labeldrop
>>>>>>     regex: "^prometheus_server$"
>>>>>>   alertmanagers:
>>>>>>   - static_configs:
>>>>>>     - targets:
>>>>>>       - alertmanager1:9093
>>>>>>       - alertmanager2:9093
>>>>>>       - alertmanager3:9093
>>>>>>       - alertmanager4:9093
>>>>>>       - alertmanager5:9093
>>>>>>       - alertmanager6:9093
>>>>>>       - alertmanager7:9093
>>>>>>       - alertmanager8:9093 
>>>>>>
>>>>>> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>>>>>>
>>>>>>> Yes, but have you configured the prometheus (the one which has 
>>>>>>> alerting rules) to have all four alertmanagers as its destination?
>>>>>>>
>>>>>>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>>>>>>
>>>>>>>> Yes Brian. As I mentioned in my post the Alertmangers are in 
>>>>>>>> cluster and this event is visible on my 4 alertmanagers.
>>>>>>>> Problem which I described is that an alerts are firing twice and it 
>>>>>>>> generates duplication. 
>>>>>>>>
>>>>>>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>>>>>>
>>>>>>>>> Are the alertmanagers clustered?  Then you should configure 
>>>>>>>>> prometheus to deliver the alert to *all* alertmanagers.
>>>>>>>>>
>>>>>>>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager 
>>>>>>>>>> on each. 
>>>>>>>>>> Everything works fine but very often I experience issue that an 
>>>>>>>>>> alert is firing again even the event is already resolved by 
>>>>>>>>>> alertmanager.
>>>>>>>>>>
>>>>>>>>>> Below logs from example event(Chrony_Service_Down) recorded by 
>>>>>>>>>> alertmanager:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ############################################################################################################
>>>>>>>>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>>>>>>>>>> component=dispatcher msg="Received alert" 
>>>>>>>>>> alert=Chrony_Service_Down[d8c020a][active]
>>>>>>>>>>
>>>>>>>>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
>>>>>>>>>> component=nflog 
>>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>>  
>>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>>> timestamp:<seconds:1673347759 nanos:262824014 > 
>>>>>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 
>>>>>>>>>> nanos:262824014 > "
>>>>>>>>>>
>>>>>>>>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>>>>>>>>>> component=dispatcher msg="Received alert" 
>>>>>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>>>>>
>>>>>>>>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
>>>>>>>>>> component=nflog 
>>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>>  
>>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>>> timestamp:<seconds:1673347888 nanos:897562679 > 
>>>>>>>>>> resolved_alerts:10151928354614242630 > 
>>>>>>>>>> expires_at:<seconds:1673779888 
>>>>>>>>>> nanos:897562679 > "
>>>>>>>>>>
>>>>>>>>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
>>>>>>>>>> component=nflog 
>>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>>  
>>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>>> timestamp:<seconds:1673347909 nanos:649205670 > 
>>>>>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 
>>>>>>>>>> nanos:649205670 > "
>>>>>>>>>>
>>>>>>>>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
>>>>>>>>>> component=nflog 
>>>>>>>>>> msg="gossiping new entry" 
>>>>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>>>>  
>>>>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>>>>>>> puppet_certname=\\\"server.example.com\\\", 
>>>>>>>>>> service=\\\"chrony\\\", severity=\\\"page\\\"}\" 
>>>>>>>>>> receiver:<group_name:\"opsgenie\" integration:\"opsgenie\" > 
>>>>>>>>>> timestamp:<seconds:1673347919 nanos:137020780 > 
>>>>>>>>>> resolved_alerts:10151928354614242630 > 
>>>>>>>>>> expires_at:<seconds:1673779919 
>>>>>>>>>> nanos:137020780 > "
>>>>>>>>>>
>>>>>>>>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>>>>>>>>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>>>>>>>>>> component=dispatcher msg="Received alert" 
>>>>>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>>>>>
>>>>>>>>>> #############################################################################################################
>>>>>>>>>>
>>>>>>>>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager 
>>>>>>>>>> fired alert second time even minute ago (Jan 10 10:50:48) the alert 
>>>>>>>>>> was 
>>>>>>>>>> marked as resolved.
>>>>>>>>>> Such behavior generates duplicate alert in our system which is 
>>>>>>>>>> quite annoying in our scale.
>>>>>>>>>>
>>>>>>>>>> What is worth to mention:
>>>>>>>>>> - For test purpose the event is scrapped by 4 Promethues 
>>>>>>>>>> servers(default) but alert rule is evaluated by one Promethues.
>>>>>>>>>> - The event occurres only once so there is no flapping which 
>>>>>>>>>> might cause another alert firing.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e08cfbb1-357b-4c29-a417-9915f3d7a4afn%40googlegroups.com.

Reply via email to