Are the alertmanagers clustered? Then you should configure prometheus to
deliver the alert to *all* alertmanagers.
On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
> Hi guys,
>
> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each.
> Everything works fine but very often I experience issue that an alert is
> firing again even the event is already resolved by alertmanager.
>
> Below logs from example event(Chrony_Service_Down) recorded by
> alertmanager:
>
>
> ############################################################################################################
> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug
> component=dispatcher msg="Received alert"
> alert=Chrony_Service_Down[d8c020a][active]
>
> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog
> msg="gossiping new entry"
> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\",
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
> integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 >
> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759
> nanos:262824014 > "
>
> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug
> component=dispatcher msg="Received alert"
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog
> msg="gossiping new entry"
> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\",
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
> integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 >
> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888
> nanos:897562679 > "
>
> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog
> msg="gossiping new entry"
> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\",
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
> integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 >
> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909
> nanos:649205670 > "
>
> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog
> msg="gossiping new entry"
> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\",
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\",
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\",
> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\"
> integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 >
> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919
> nanos:137020780 > "
>
> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]:
> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug
> component=dispatcher msg="Received alert"
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> #############################################################################################################
>
> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired
> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as
> resolved.
> Such behavior generates duplicate alert in our system which is quite
> annoying in our scale.
>
> What is worth to mention:
> - For test purpose the event is scrapped by 4 Promethues servers(default)
> but alert rule is evaluated by one Promethues.
> - The event occurres only once so there is no flapping which might cause
> another alert firing.
>
> Thanks
>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/95e1af48-5d49-473f-9d39-086625b147e5n%40googlegroups.com.