[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Brian Candler Wed, 26 Apr 2023 02:29:30 -0700

P.S. Your expression

>    expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
 
>= 85


can be simplified to:

>    expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / node_filesystem_size_bytes) >= 85

That's because the result instant vector for an expression like "foo / bar" 
only includes entries where the label sets match on left and right hand 
sides.  Any others are dropped silently.  (This form may be slightly less 
efficient, but I wouldn't expect it to be a problem unless you have 
hundreds of thousands of filesystems)

I would be inclined to simplify it further to:

>    expr: 
node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
/ node_filesystem_size_bytes < 0.15

You can use {{ $value | humanizePercentage }} in your alert annotations to 
show readable percentages.

On Wednesday, 26 April 2023 at 08:14:35 UTC+1 Brian Candler wrote:

> > I guess with (2) you also meant having a route which is then permanently 
> muted?
>
> I'd use a route with a null receiver (i.e. a receiver which has no 
> <transport>_configs under it)
>
> > b) The idea hat I had above:
> > - using <alert_relabel_configs> to filter on the instances and add a 
> label if it should be silenced
> > - use only that label in the expr instead of the full regex
> > But would that even work?
>
> No, because as far as I know alert_relabel_configs is done *after* the 
> alert is generated from the alerting rule. It's only used to add extra 
> labels before sending the generated alert to alertmanager. (It occurs to me 
> that it *might* be possible to use 'drop' rules here to discard alerts; 
> that would be a very confusing config IMO)
>
> > For me it's really like this:
> > My Prometheus instance monitors:
> > - my "own" instances, where I need to react on things like >85% usage on 
> root filesystem (and thus want to get an alert)
> > - "foreign" instances, where I just get the node exporter data and show 
> e.g. CPU usage, IO usage, and so on as a convenience to users of our 
> cluster - but any alert conditions wouldn't cause any further action on my 
> side (and the guys in charge of those servers have their own monitoring)
>
> In this situation, and if you are using static_configs or file_sd_configs 
> to identify the hosts, then I would simply use a target label (e.g. 
> "owner") to distinguish which targets are yours and which are foreign; or I 
> would use two different scrape jobs for self and foreign (which means the 
> "job" label can be used to distinguish them)
>
> The storage cost of having extra labels in the TSDB is essentially zero, 
> because it's the unique combination of labels that identifies the 
> timeseries - the bag of labels is mapped to an integer ID I believe.  So 
> the only problem is if this label changes often, and to me it sounds like a 
> 'local' or 'foreign' instance remains this way indefinitely.
>
> If you really want to keep these labels out of the metrics, then having a 
> separate timeseries with metadata for each instance is the next-best 
> option. Suppose you have a bunch of metrics with an 'instance' label, e.g.
>
> node_filesystem_free_bytes(instance="bar", ....}
> node_filesystem_size_bytes(instance="bar", ....}
> ...
>
> as the actual metrics you're monitoring, then you create one extra static 
> timeseries per host (instance) like this:
>
> meta{instance="bar",owner="self",site="london"} 1
>
> (aside: TSDB storage for this will be almost zero, because of the 
> delta-encoding used). These can be created by scraping a static webserver, 
> or by using recording rules.
>
> Then your alerting rules can be like this:
>
> expr: |
>   (
>      ... normal rule here ...
>   ) * on(instance) group_left(site) meta{owner="self"}
>
> The join will:
> * Limit alerting to those hosts which have a corresponding 'meta' 
> timeseries (matched on 'instance') and which has label owner="self"
> * Add the "site" label to the generated alerts
>
> Beware that:
>
> 1. this will suppress alerts for any host which does not have a 
> corresponding 'meta' timeseries. It's possible to work around this to 
> default to sending rather than not sending alerts, but makes the 
> expressions more complex:
> https://www.robustperception.io/left-joins-in-promql
>
> 2.  the "instance" labels must match exactly. So for example, if you're 
> currently scraping with the default label instance="foo:9100" then you'll 
> need to change this to instance="foo" (which is good practice anyway).  See
> https://www.robustperception.io/controlling-the-instance-label
>
> (I use some relabel_configs tricks for this; examples posted in this group 
> previously)
>
> > From all that it seems to me that the "best" solution is either:
> > a) simply making more complex and error prone alert rules, that filter 
> out the instances in the first place, like in:
> >    expr: 100 - 
> ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
>  
> * 100) / 
> node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
>  
> >= 85
>
> That's not great, because as you observe it will become more and more 
> complex over time; and in any case won't work if you want to treat certain 
> combinations of labels differently (e.g. stop alerting on a specific 
> *filesystem* on a specific host)
>
> If you really don't want to use either of the solutions I've given above, 
> then another way is to write some code to preprocess your alerting rules, 
> i.e. expand a single template rule into a bunch of separate rules, based on 
> your own templates and data sources.
>
> HTH,
>
> Brian.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ae3aa23e-a67d-41a2-a3c6-805487ec817cn%40googlegroups.com.

[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

Reply via email to