Thanks you for the further clarification. I think the crux of my issue was (wrongfully) assuming that the documentation was instructing me to not use a load balancer for HA/network partitioning concerns only, and not that full Alertmanager cluster state isn't being gossiped. I may try to put a PR up on Monday for the docs to clarify this for what would have saved us a bit of time debugging.
On Saturday, December 4, 2021 at 7:52:21 PM UTC-5 [email protected] wrote: > The technical reason for this admonition is in how the > Prometheus-Alertmanager complex implements high availability notifications. > > The design goal is to send a notification in all possible circumstances, > and *if possible* only send one. > > By spraying alerts to the list of all Alertmanager instances, each of > these *can* send the notification even if Alertmanager clustering is > completely broken, for example due to network partitions, misconfiguration, > or some Alertmanager instances being unable to send out the notification. > > Worst case, you get multiple notifications, one from each Alertmanager. > Some downstream services, like PagerDuty, will do their own deduplication, > so you may not even notice. In other cases, like Slack or email, you get > multiple but that's much better than none! > > Every time Prometheus evaluates an alert rule, and finds it to be firing, > it will send an event to every Alertmanager it knows about, with an endsAt > time a few minutes into the future. As this goes on, the updated endsAt > keeps being a few minutes away. > > Each Alertmanager individually will determine what > notifications (firing or resolved) should be sent. When clustering works, > Alertmanagers will communicate which notifications have already been sent, > so you only get one of each in the happy case. > > If you add a load balancer, only one Alertmanager will know that this > alert even happened, and if for some reason it can't reach you, you may > never know there was a problem. > > This is somewhat mitigated in your case because Prometheus sends a new > event on every rule evaluation cycle. Eventually, this will randomly reach > every Alertmanager instance, but not necessarily in time to prevent the > last event from timing out. These different timeouts is what you have > observed as different endsAt times. > > So the underlying reason is as you say – high availability and network > partitioning. The architecture to achieve that, with Prometheus repeatedly > sending short-term events, means that randomly load balancing these to only > one of the Alertmanager instances will lead to weird effects including > spurious "resolved" notifications. > > /MR > > > On Sat, Dec 4, 2021, 19:17 Brian Candler <[email protected]> wrote: > >> Just to note what it says here >> <https://prometheus.io/docs/alerting/latest/alertmanager/#high-availa>: >> >> *It's important not to load balance traffic between Prometheus and its >> Alertmanagers, but instead, point Prometheus to a list of all >> Alertmanagers.* >> >>> -- >> > You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/22ae33ed-429c-4783-8aaa-44c749bf26abn%40googlegroups.com.

