[prometheus-users] Re: Guidance on Prometheus Alerting for Shutdown Instances

'Brian Candler' via Prometheus Users Wed, 06 Dec 2023 05:29:00 -0800

Is there a metric from which you can determine whether a particular 
instance has been "intentionally shut down"? If so, you can use a join 
between the metrics in your PromQL alert. e.g.

    expr:  increase(foo[5m]) < 1 unless on (instance) adminShutdown == 1

(Aside: this is not a boolean expression. if/and/unless are set union and 
intersection operators 
<https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>.

The LHS is a vector of alerts; the RHS is also a vector, filtered down to 
only to those timeseries where the value is 1; the "unless" operator 
suppresses all vector elements in the LHS where there's a matching set of 
labels on the RHS. in this case considering only the "instance" label 
because that's what the "on" clause specifies)

If you don't already have a metric you can use for this, then maybe you 
need to create one. This could be done on each target - for example using 
the node_exporter textfile collector, drop a file like this into the 
collector directory:

adminShutdown 0

Scraping will add the 'instance' label for you.

Or globally - e.g. create a list of metrics describing the state of each 
instance, put it on a HTTP server, and scrape it in its own scrape job with 
"honor_labels: true" to prevent the instance labels being overridden.

adminShutdown{instance="foo"} 0
adminShutdown{instance="bar"} 1
adminShutdown{instance="baz"} 0

Don't worry about the few extra timeseries this will create. Prometheus 
compresses timeseries extremely well, especially where scrapes give 
repeated identical values.

On Wednesday 6 December 2023 at 10:05:54 UTC Tim B. wrote:

> Hello everyone,
>
> I'm relatively new to Prometheus, so your patience is much appreciated.
>
> I'm facing an issue and seeking guidance:
>
> I'm working with a metric like CPU usage, where instance identifiers are 
> submitted as labels. To ensure instances are running as expected, I've 
> defined an alert based on this metric. The alert triggers when the 
> aggregation value (in my case, the increase) over a time window falls below 
> an expected threshold. By utilizing the instance identifier as a label, 
> I've streamlined the alert definition to one.
>
> So far, I've been successful in achieving this. However, I'm grappling 
> with how to handle instances that have been intentionally shut down. Since 
> the metric value for these instances remains static, the alert consistently 
> fires.
>
> How can I address this challenge? Did I make a fundamentally flawed 
> modeling decision? Any insights would be greatly appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a7a3c938-5a4b-4300-882a-a71214d71d36n%40googlegroups.com.

[prometheus-users] Re: Guidance on Prometheus Alerting for Shutdown Instances

Reply via email to