I wouldn't have thought that a few hundred ms of latency would make any
difference.
I am however worried about the gossiping. If this is one monster-sized
cluster, then all 8 nodes should be communicating with every other 7 nodes.
I'd say this is a bad design. Either:
1. Have a single global alertmanager cluster, with 2 nodes - that will give
you excellent high availability for your alerting. (How often do expect
two regions to go offline simultaneously?) Or 3 nodes if your management
absolutely insists on it. (But this isn't the sort of cluster that needs
to maintain a quorum).
Or:
2. Completely separate the regions. Have one alertmanager cluster in
region A, one cluster in region B, one cluster in region C. Have the
prometheus instances in region A only talking to the alertmanager instances
in region A, and so on. In this case, each region sends its alerts
completely independently.
There is little benefit in option (2) unless there are tight restrictions
on inter-region communication; it gives you a lot more stuff to manage. If
you need to go this route, then having a frontend like Karma or alerta.io
may be helpful.
On Friday, 13 January 2023 at 14:53:14 UTC LukaszSz wrote:
> Interesting. Seems that the alertmanagers are spread over 3 different
> regions ( 2xAsia, 2xUSA,4xEurope).
> Maybe there is some latency problem between them like latency in gossip
> messages ?
>
> On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote:
>
>> That's a lot of alertmanagers. Are they all fully meshed? (But I'd say
>> 2 or 3 would be better - spread over different regions)
>>
>> On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
>>
>>> Yes. The prometheus server is configured to communicate with all
>>> alertmanagers ( sorry there are 8 alertmanagers ):
>>>
>>> alerting:
>>> alert_relabel_configs:
>>> - action: labeldrop
>>> regex: "^prometheus_server$"
>>> alertmanagers:
>>> - static_configs:
>>> - targets:
>>> - alertmanager1:9093
>>> - alertmanager2:9093
>>> - alertmanager3:9093
>>> - alertmanager4:9093
>>> - alertmanager5:9093
>>> - alertmanager6:9093
>>> - alertmanager7:9093
>>> - alertmanager8:9093
>>>
>>> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>>>
>>>> Yes, but have you configured the prometheus (the one which has alerting
>>>> rules) to have all four alertmanagers as its destination?
>>>>
>>>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>>>
>>>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster
>>>>> and this event is visible on my 4 alertmanagers.
>>>>> Problem which I described is that an alerts are firing twice and it
>>>>> generates duplication.
>>>>>
>>>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>>>
>>>>>> Are the alertmanagers clustered? Then you should configure
>>>>>> prometheus to deliver the alert to *all* alertmanagers.
>>>>>>
>>>>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on
>>>>>>> each.
>>>>>>> Everything works fine but very often I experience issue that an
>>>>>>> alert is firing again even the event is already resolved by
>>>>>>> alertmanager.
>>>>>>>
>>>>>>> Below logs from example event(Chrony_Service_Down) recorded by
>>>>>>> alertmanager:
>>>>>>>
>>>>>>>
>>>>>>> ############################################################################################################
>>>>>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug
>>>>>>> component=dispatcher msg="Received alert"
>>>>>>> alert=Chrony_Service_Down[d8c020a][active]
>>>>>>>
>>>>>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug
>>>>>>> component=nflog
>>>>>>> msg="gossiping new entry"
>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>
>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347759
>>>>>>> nanos:262824014 >
>>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759
>>>>>>> nanos:262824014 > "
>>>>>>>
>>>>>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug
>>>>>>> component=dispatcher msg="Received alert"
>>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>>
>>>>>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug
>>>>>>> component=nflog
>>>>>>> msg="gossiping new entry"
>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>
>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347888
>>>>>>> nanos:897562679 >
>>>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888
>>>>>>> nanos:897562679 > "
>>>>>>>
>>>>>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug
>>>>>>> component=nflog
>>>>>>> msg="gossiping new entry"
>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>
>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347909
>>>>>>> nanos:649205670 >
>>>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909
>>>>>>> nanos:649205670 > "
>>>>>>>
>>>>>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug
>>>>>>> component=nflog
>>>>>>> msg="gossiping new entry"
>>>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>>>
>>>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347919
>>>>>>> nanos:137020780 >
>>>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919
>>>>>>> nanos:137020780 > "
>>>>>>>
>>>>>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]:
>>>>>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug
>>>>>>> component=dispatcher msg="Received alert"
>>>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>>>
>>>>>>> #############################################################################################################
>>>>>>>
>>>>>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager
>>>>>>> fired alert second time even minute ago (Jan 10 10:50:48) the alert was
>>>>>>> marked as resolved.
>>>>>>> Such behavior generates duplicate alert in our system which is quite
>>>>>>> annoying in our scale.
>>>>>>>
>>>>>>> What is worth to mention:
>>>>>>> - For test purpose the event is scrapped by 4 Promethues
>>>>>>> servers(default) but alert rule is evaluated by one Promethues.
>>>>>>> - The event occurres only once so there is no flapping which might
>>>>>>> cause another alert firing.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/f68e61cc-c5b2-4d43-98a3-0c9514faee5cn%40googlegroups.com.