That's a lot of alertmanagers. Are they all fully meshed? (But I'd say 2
or 3 would be better - spread over different regions)
On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
> Yes. The prometheus server is configured to communicate with all
> alertmanagers ( sorry there are 8 alertmanagers ):
>
> alerting:
> alert_relabel_configs:
> - action: labeldrop
> regex: "^prometheus_server$"
> alertmanagers:
> - static_configs:
> - targets:
> - alertmanager1:9093
> - alertmanager2:9093
> - alertmanager3:9093
> - alertmanager4:9093
> - alertmanager5:9093
> - alertmanager6:9093
> - alertmanager7:9093
> - alertmanager8:9093
>
> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>
>> Yes, but have you configured the prometheus (the one which has alerting
>> rules) to have all four alertmanagers as its destination?
>>
>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>
>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and
>>> this event is visible on my 4 alertmanagers.
>>> Problem which I described is that an alerts are firing twice and it
>>> generates duplication.
>>>
>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>
>>>> Are the alertmanagers clustered? Then you should configure prometheus
>>>> to deliver the alert to *all* alertmanagers.
>>>>
>>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on
>>>>> each.
>>>>> Everything works fine but very often I experience issue that an alert
>>>>> is firing again even the event is already resolved by alertmanager.
>>>>>
>>>>> Below logs from example event(Chrony_Service_Down) recorded by
>>>>> alertmanager:
>>>>>
>>>>>
>>>>> ############################################################################################################
>>>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug
>>>>> component=dispatcher msg="Received alert"
>>>>> alert=Chrony_Service_Down[d8c020a][active]
>>>>>
>>>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug
>>>>> component=nflog
>>>>> msg="gossiping new entry"
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014
>>>>> >
>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759
>>>>> nanos:262824014 > "
>>>>>
>>>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug
>>>>> component=dispatcher msg="Received alert"
>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>
>>>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug
>>>>> component=nflog
>>>>> msg="gossiping new entry"
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679
>>>>> >
>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888
>>>>> nanos:897562679 > "
>>>>>
>>>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug
>>>>> component=nflog
>>>>> msg="gossiping new entry"
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670
>>>>> >
>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909
>>>>> nanos:649205670 > "
>>>>>
>>>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug
>>>>> component=nflog
>>>>> msg="gossiping new entry"
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\",
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780
>>>>> >
>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919
>>>>> nanos:137020780 > "
>>>>>
>>>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]:
>>>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug
>>>>> component=dispatcher msg="Received alert"
>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>
>>>>> #############################################################################################################
>>>>>
>>>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager
>>>>> fired alert second time even minute ago (Jan 10 10:50:48) the alert was
>>>>> marked as resolved.
>>>>> Such behavior generates duplicate alert in our system which is quite
>>>>> annoying in our scale.
>>>>>
>>>>> What is worth to mention:
>>>>> - For test purpose the event is scrapped by 4 Promethues
>>>>> servers(default) but alert rule is evaluated by one Promethues.
>>>>> - The event occurres only once so there is no flapping which might
>>>>> cause another alert firing.
>>>>>
>>>>> Thanks
>>>>>
>>>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/91381a22-f874-4802-a431-619f6b3d2ecdn%40googlegroups.com.