On Tuesday, April 25, 2023 at 9:32:25 AM UTC+2 Brian Candler wrote:

I think you would have basically the same problem with Icinga unless you 
have configured Icinga with a list of RAID controllers which should be 
present on a given device, or a list of drives which should be present in a 
particular RAID array.


Well true, you still depend on the RAID tool to actually detect the 
controller and any RAIDs managed by that.

But Icinga would likely catch most real world issues that may happen by 
accident:
- raid tool not installed
- some wrong parameters used when invoking the tool (e.g. a new version 
that might have changed command names)
- permissions issues (like tool not run as root, broken sudo rules
 

I'm not sure if you realise this, but the expression "up == 0" is not a 
boolean, it's a filter.  The metric "up" has many different timeseries, 
each with a different label set, and each with a value.  The PromQL 
expression "up" returns all of those timeseries.  The expression "up == 0" 
filters it down to a subset: just those timeseries where the value is 0.  
Hence this expression could return 0, 1 or more timeseries.  When used as 
an alerting expression, the alert triggers if the expression returns one or 
more timeseries (and regardless of the *value* of those timeseries).  When 
you understand this, then using PromQL for alerting makes much more sense.


Well I think that's clear... I have one (scalar) value in up for each 
target I scrape, e.g. if I have just node exporter running, I'd get one 
(scalar) value for the scraped node exporter of every instance.

But the problem is that this does not necessarily tell me if e.g. my raid 
status result was contained in that scraped data, does it?

It depends on the exporter... if I had a separate exporter just for the 
RAID metrics, then I'd be fine. But if it's part of a larger one, like node 
exporter, it would depend if that errors out just because the RAID data 
couldn't be determined. And I guess most exporters would pre default just 
work fine, if e.g. there was simply no RAID tools installed (which does 
make sense in a way).

But it would also mean, that I wouldn't notice the error, if e.g. I forgot 
to install the tool.
In Icinga I'd notice this, cause I have the configured check per host. If 
that runs and doesn't find e.g. MegaCli... it would error out.

Prometheus OTOH knows just about the target (i.e. the host) and the 
exporter (e.g. node)... so it cannot really tell "ah... the RAID tool is 
missing"... unless node exporter had an option that would tell it to insist 
on RAID tool xyz being executed and fail otherwise.
That's basically what I'd like to do manually.


However, if the RAID controller card were to simply vanish, then yes the 
corresponding metrics would vanish - similarly if a drive were to vanish 
from an array, its status would vanish.


Well but that would usually also be unnoticed in the Icinga setup...  but 
it's also something that I think never really happens - and if it does one 
probably sees other errors like broken filesystems.

 

You can create alert expressions which check for a specific sentinel metric 
being present with absent(...), and you can do things like joining with the 
'up' metric, so you can say "if any target is being scraped, then alert me 
if that target doesn't return metric X".  It *is* a bit trickier to 
understand than a simple alerting condition, but it can be done.


I guess that sounds what I'd like to do. Thanks for the below pointers :-)

https://www.robustperception.io/absent-alerting-for-scraped-metrics/

expr: up{job="myjob"} == 1 unless my_metric

So my_metric would return "something" as soon as it was contained (in the 
most recent scrape!)... and if it wasn't, up{job="myjob"} == 1 would 
silence the "extra" error, in case it is NOT up anyway.

So in that case one should do always both:
- in general, check for any targets/jobs that are not up
- in specific (for e.g. very important metrics), additionally check for the 
specific metric.
 Right?

In general, when I get the value of some time series like 
node_cpu_seconds_total ... when that is missing for e.g. one instance I 
would get nothing, right? I.e. there is no special value, just the vector 
of scalar has one element less. But if I do get a value, it's for sure the 
one from the most recent scrape?!
  

https://www.robustperception.io/absent-alerting-for-jobs/

Is this with absent() also needed when I have all my targets/jobs 
statically configured? I guess not because Prometheus should know about it 
and reflect it in `up` if any of them couldn't be scraped, right?

 

As for drives vanishing from an array, you can write expressions using 
count() to check the number of drives.  If you have lots of machines and 
don't want separate rules per controller, then it's possible to use another 
timeseries as a threshold, again this a bit more complex:
https://www.robustperception.io/using-time-series-as-alert-thresholds


Thanks, but I guess that scenario (RAID volume suddenly vanishing) is 
anyway too unlikely as that I'd bother. And if it happens... many other 
bells and whistles would go off.


But personally I would go really simple, and just create an alert whenever 
the count *changes*.  You can do this using something as simple as:
   expr: foo != foo offset 5m


That's however a really good idea... and quite simple (AFAIU it should work 
like that out of the box for all possible instances, right?).
But that would also fire once at initialisation, and when it then really 
fires... it would silence again after another 5 min (unless the could 
changes again), right?

 

(this compares the value of foo now, with the value of foo 5 minutes ago). 
Similarly, you can alert when any given metric vanishes:

    expr: foo offset 5m unless foo


But same here as above, right? It would no longer fire, after another 5m?

 

Do you have some specifics about what types of RAID you want to monitor?


I'll have few MegaRAID based controllers and other than that mostly HP 
Smart Storage Admin CLI (ssacli)... haven't really looked yet for any 
exporters of these.

  I've done this for mdraid (using node_exporter) and for MegaRAID, using 
smartmon.py/sh from 
https://github.com/prometheus-community/node-exporter-textfile-collector-scripts


How does that work via smartmon?

 

If using textfile collector scripts, there is a timestamp metric you can 
use to check when your script last wrote the file 
(node_textfile_mtime_seconds), which means it's easy to create an alert to 
check if your script hasn't run recently.


Okay,.. guess I'll have to look into that first,... never did it so far (I 
mean using text file collector scripts).
 

This was all running Prometheus completely standalone though.  If you want 
to feed existing Icinga checks into Prometheus, or Prometheus metrics into 
Icinga, that's a different matter.


For Icinga/Nagios and RAID most people seem to use check_raid [0], which 
seemed a bit unmaintained for some years, though it got a few commits 
recently.

But in principle I'd like to use Prometheus standalone to keep the 
maintenance effort as low as possible.... so if I can do it without any of 
that, I'd be happy.
OTOH, I would rather want to avoid writing my own exporters just for some 
RAID checks (=metrics).


Thanks :-)
Chris.

[0] https://github.com/glensc/nagios-plugin-check_raid

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f16dff00-fbfd-4306-bf00-8651145459cfn%40googlegroups.com.

Reply via email to