Is there a metric from which you can determine whether a particular
instance has been "intentionally shut down"? If so, you can use a join
between the metrics in your PromQL alert. e.g.
expr: increase(foo[5m]) < 1 unless on (instance) adminShutdown == 1
(Aside: this is not a boolean expression. if/and/unless are set union and
intersection operators
<https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>.
The LHS is a vector of alerts; the RHS is also a vector, filtered down to
only to those timeseries where the value is 1; the "unless" operator
suppresses all vector elements in the LHS where there's a matching set of
labels on the RHS. in this case considering only the "instance" label
because that's what the "on" clause specifies)
If you don't already have a metric you can use for this, then maybe you
need to create one. This could be done on each target - for example using
the node_exporter textfile collector, drop a file like this into the
collector directory:
adminShutdown 0
Scraping will add the 'instance' label for you.
Or globally - e.g. create a list of metrics describing the state of each
instance, put it on a HTTP server, and scrape it in its own scrape job with
"honor_labels: true" to prevent the instance labels being overridden.
adminShutdown{instance="foo"} 0
adminShutdown{instance="bar"} 1
adminShutdown{instance="baz"} 0
Don't worry about the few extra timeseries this will create. Prometheus
compresses timeseries extremely well, especially where scrapes give
repeated identical values.
On Wednesday 6 December 2023 at 10:05:54 UTC Tim B. wrote:
> Hello everyone,
>
> I'm relatively new to Prometheus, so your patience is much appreciated.
>
> I'm facing an issue and seeking guidance:
>
> I'm working with a metric like CPU usage, where instance identifiers are
> submitted as labels. To ensure instances are running as expected, I've
> defined an alert based on this metric. The alert triggers when the
> aggregation value (in my case, the increase) over a time window falls below
> an expected threshold. By utilizing the instance identifier as a label,
> I've streamlined the alert definition to one.
>
> So far, I've been successful in achieving this. However, I'm grappling
> with how to handle instances that have been intentionally shut down. Since
> the metric value for these instances remains static, the alert consistently
> fires.
>
> How can I address this challenge? Did I make a fundamentally flawed
> modeling decision? Any insights would be greatly appreciated.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/a7a3c938-5a4b-4300-882a-a71214d71d36n%40googlegroups.com.