[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Brian Candler Tue, 25 Apr 2023 00:32:30 -0700

I think you would have basically the same problem with Icinga unless you 
have configured Icinga with a list of RAID controllers which should be 
present on a given device, or a list of drives which should be present in a 
particular RAID array.

> I mean I can of course check e.g.
> expr: up == 0
> in some alert.
> But AFAIU this actually just tells me whether there are any scrape
> targets that couldn't be scraped (in the last run, based on the scrape
> interval), right?

There is a separate 'up' metric for each individual target that is being 
scraped, so it's not just "any" target that failed - you can see exactly 
*which* target(s) failed.

If the exporter on a particular target were to go wrong internally, it 
should return an error (like a 500 HTTP response) which would cause its 
corresponding 'up' metric to go to 0.

I'm not sure if you realise this, but the expression "up == 0" is not a 
boolean, it's a filter.  The metric "up" has many different timeseries, 
each with a different label set, and each with a value.  The PromQL 
expression "up" returns all of those timeseries.  The expression "up == 0" 
filters it down to a subset: just those timeseries where the value is 0.  
Hence this expression could return 0, 1 or more timeseries.  When used as 
an alerting expression, the alert triggers if the expression returns one or 
more timeseries (and regardless of the *value* of those timeseries).  When 
you understand this, then using PromQL for alerting makes much more sense.

However, if the RAID controller card were to simply vanish, then yes the 
corresponding metrics would vanish - similarly if a drive were to vanish 
from an array, its status would vanish.

You can create alert expressions which check for a specific sentinel metric 
being present with absent(...), and you can do things like joining with the 
'up' metric, so you can say "if any target is being scraped, then alert me 
if that target doesn't return metric X".  It *is* a bit trickier to 
understand than a simple alerting condition, but it can be done.

https://www.robustperception.io/absent-alerting-for-scraped-metrics/
https://www.robustperception.io/existential-issues-with-metrics/
https://www.robustperception.io/absent-alerting-for-jobs/

As for drives vanishing from an array, you can write expressions using 
count() to check the number of drives.  If you have lots of machines and 
don't want separate rules per controller, then it's possible to use another 
timeseries as a threshold, again this a bit more complex:
https://www.robustperception.io/using-time-series-as-alert-thresholds

But personally I would go really simple, and just create an alert whenever 
the count *changes*.  You can do this using something as simple as:

   expr: foo != foo offset 5m

(this compares the value of foo now, with the value of foo 5 minutes ago). 
Similarly, you can alert when any given metric vanishes:

    expr: foo offset 5m unless foo

Those sort of simple alerts have great value.

Do you have some specifics about what types of RAID you want to monitor?  
I've done this for mdraid (using node_exporter) and for MegaRAID, using 
smartmon.py/sh 
from 
https://github.com/prometheus-community/node-exporter-textfile-collector-scripts

If using textfile collector scripts, there is a timestamp metric you can 
use to check when your script last wrote the file 
(node_textfile_mtime_seconds), which means it's easy to create an alert to 
check if your script hasn't run recently.

This was all running Prometheus completely standalone though.  If you want 
to feed existing Icinga checks into Prometheus, or Prometheus metrics into 
Icinga, that's a different matter.

HTH,

Brian.

On Tuesday, 25 April 2023 at 03:27:21 UTC+1 Christoph Anton Mitterer wrote:

> Hey there.
>
> What I'm trying to do is basically replace Icinga with Prometheus (or
> well not really replacing, but integrating it into the latter, which I
> anyway need for other purposes).
>
> So I'll have e.g. some metric that shows me the RAID status on
> instances, and I want to get an alert, when a HDD is broken.
>
>
> I guess it's obvious that it could turn out bad if I don't get an
> alert, just because the metric data isn't there (for some reason).
>
>
> In Icinga, this would have been simple:
> The system knows about every host and every service it needs to check.
> If there's no result (like RAID is OK or FAILED) anymore (e.g. because
> the raid CLI tool is no installed), the check's status would at least
> go into UNKNOWN.
>
>
>
> I wonder how this is / can be handled in Prometheus?
>
>
> I mean I can of course check e.g.
> expr: up == 0
> in some alert.
> But AFAIU this actually just tells me whether there are any scrape
> targets that couldn't be scraped (in the last run, based on the scrape
> interval), right?
>
> If my important checks were all their own exporters, e.g. one exporter
> just for the RAID status, then - AFAIU - this would already work any
> notify me for sure, even if there's no result at all.
>
> But what if it's part of some larger exporter, like e.g. the mdadm data
> in node exporter.
>
> up wouldn't become 0, just because node_md_disks would be not part of
> the metrics.
>
>
> Even if I'd say it's the duty of the exporter to make sure that there
> is a result even on failure to read the status... what e.g. if some
> tool is already needed just to determine whether that metric make sense
> to be collected at all.
> That would by typical for most hardware RAID controllers... you need
> the respective RAID tool just to see whether any RAIDS are present.
>
>
> So in principle I'd like a simple way to check for a certain group of
> hosts on the availability of a certain time series, so that I can set
> up e.g. an alert that fires if any node where I have e.g. some MegaCLI
> based RAID, lacks megacli_some_metric.
>
> Or is there some other/better way this is done in practise?
>
>
> Thanks,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/34c6c4c7-a19e-423b-bae0-6a06ecc971aen%40googlegroups.com.

[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Reply via email to