Re: [prometheus-users] Question regarding Loadbalanced Alertmanager Clusters

Matthias Rampke Sat, 04 Dec 2021 16:52:22 -0800

The technical reason for this admonition is in how the
Prometheus-Alertmanager complex implements high availability notifications.

The design goal is to send a notification in all possible circumstances,
and *if possible* only send one.

By spraying alerts to the list of all Alertmanager instances, each of these
*can* send the notification even if Alertmanager clustering is completely
broken, for example due to network partitions, misconfiguration, or some
Alertmanager instances being unable to send out the notification.

 Worst case, you get multiple notifications, one from each Alertmanager.
Some downstream services, like PagerDuty, will do their own deduplication,
so you may not even notice. In other cases, like Slack or email, you get
multiple but that's much better than none!

Every time Prometheus evaluates an alert rule, and finds it to be firing,
it will send an event to every Alertmanager it knows about, with an endsAt
time a few minutes into the future. As this goes on, the updated endsAt
keeps being a few minutes away.

Each Alertmanager individually will determine what
notifications (firing or resolved) should be sent. When clustering works,
Alertmanagers will communicate which notifications have already been sent,
so you only get one of each in the happy case.

If you add a load balancer, only one Alertmanager will know that this alert
even happened, and if for some reason it can't reach you, you may never
know there was a problem.

This is somewhat mitigated in your case because Prometheus sends a new
event on every rule evaluation cycle. Eventually, this will randomly reach
every Alertmanager instance, but not necessarily in time to prevent the
last event from timing out. These different timeouts is what you have
observed as different endsAt times.

So the underlying reason is as you say – high availability and network
partitioning. The architecture to achieve that, with Prometheus repeatedly
sending short-term events, means that randomly load balancing these to only
one of the Alertmanager instances will lead to weird effects including
spurious "resolved" notifications.

/MR

On Sat, Dec 4, 2021, 19:17 Brian Candler <[email protected]> wrote:

> Just to note what it says here
> <https://prometheus.io/docs/alerting/latest/alertmanager/#high-availa>:
>
> *It's important not to load balance traffic between Prometheus and its
> Alertmanagers, but instead, point Prometheus to a list of all
> Alertmanagers.*
>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/3af11c5b-ec1a-4f97-8f67-10a4f5a03f41n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAMV%3D_ga-NhnSUCjuZE0-rjZo1%3DqEn-TDdYNXDgujfS%2B7221%3Dkw%40mail.gmail.com.

Re: [prometheus-users] Question regarding Loadbalanced Alertmanager Clusters

Reply via email to