I think you would have basically the same problem with Icinga unless you have configured Icinga with a list of RAID controllers which should be present on a given device, or a list of drives which should be present in a particular RAID array.
> I mean I can of course check e.g. > expr: up == 0 > in some alert. > But AFAIU this actually just tells me whether there are any scrape > targets that couldn't be scraped (in the last run, based on the scrape > interval), right? There is a separate 'up' metric for each individual target that is being scraped, so it's not just "any" target that failed - you can see exactly *which* target(s) failed. If the exporter on a particular target were to go wrong internally, it should return an error (like a 500 HTTP response) which would cause its corresponding 'up' metric to go to 0. I'm not sure if you realise this, but the expression "up == 0" is not a boolean, it's a filter. The metric "up" has many different timeseries, each with a different label set, and each with a value. The PromQL expression "up" returns all of those timeseries. The expression "up == 0" filters it down to a subset: just those timeseries where the value is 0. Hence this expression could return 0, 1 or more timeseries. When used as an alerting expression, the alert triggers if the expression returns one or more timeseries (and regardless of the *value* of those timeseries). When you understand this, then using PromQL for alerting makes much more sense. However, if the RAID controller card were to simply vanish, then yes the corresponding metrics would vanish - similarly if a drive were to vanish from an array, its status would vanish. You can create alert expressions which check for a specific sentinel metric being present with absent(...), and you can do things like joining with the 'up' metric, so you can say "if any target is being scraped, then alert me if that target doesn't return metric X". It *is* a bit trickier to understand than a simple alerting condition, but it can be done. https://www.robustperception.io/absent-alerting-for-scraped-metrics/ https://www.robustperception.io/existential-issues-with-metrics/ https://www.robustperception.io/absent-alerting-for-jobs/ As for drives vanishing from an array, you can write expressions using count() to check the number of drives. If you have lots of machines and don't want separate rules per controller, then it's possible to use another timeseries as a threshold, again this a bit more complex: https://www.robustperception.io/using-time-series-as-alert-thresholds But personally I would go really simple, and just create an alert whenever the count *changes*. You can do this using something as simple as: expr: foo != foo offset 5m (this compares the value of foo now, with the value of foo 5 minutes ago). Similarly, you can alert when any given metric vanishes: expr: foo offset 5m unless foo Those sort of simple alerts have great value. Do you have some specifics about what types of RAID you want to monitor? I've done this for mdraid (using node_exporter) and for MegaRAID, using smartmon.py/sh from https://github.com/prometheus-community/node-exporter-textfile-collector-scripts If using textfile collector scripts, there is a timestamp metric you can use to check when your script last wrote the file (node_textfile_mtime_seconds), which means it's easy to create an alert to check if your script hasn't run recently. This was all running Prometheus completely standalone though. If you want to feed existing Icinga checks into Prometheus, or Prometheus metrics into Icinga, that's a different matter. HTH, Brian. On Tuesday, 25 April 2023 at 03:27:21 UTC+1 Christoph Anton Mitterer wrote: > Hey there. > > What I'm trying to do is basically replace Icinga with Prometheus (or > well not really replacing, but integrating it into the latter, which I > anyway need for other purposes). > > So I'll have e.g. some metric that shows me the RAID status on > instances, and I want to get an alert, when a HDD is broken. > > > I guess it's obvious that it could turn out bad if I don't get an > alert, just because the metric data isn't there (for some reason). > > > In Icinga, this would have been simple: > The system knows about every host and every service it needs to check. > If there's no result (like RAID is OK or FAILED) anymore (e.g. because > the raid CLI tool is no installed), the check's status would at least > go into UNKNOWN. > > > > I wonder how this is / can be handled in Prometheus? > > > I mean I can of course check e.g. > expr: up == 0 > in some alert. > But AFAIU this actually just tells me whether there are any scrape > targets that couldn't be scraped (in the last run, based on the scrape > interval), right? > > If my important checks were all their own exporters, e.g. one exporter > just for the RAID status, then - AFAIU - this would already work any > notify me for sure, even if there's no result at all. > > But what if it's part of some larger exporter, like e.g. the mdadm data > in node exporter. > > up wouldn't become 0, just because node_md_disks would be not part of > the metrics. > > > Even if I'd say it's the duty of the exporter to make sure that there > is a result even on failure to read the status... what e.g. if some > tool is already needed just to determine whether that metric make sense > to be collected at all. > That would by typical for most hardware RAID controllers... you need > the respective RAID tool just to see whether any RAIDS are present. > > > So in principle I'd like a simple way to check for a certain group of > hosts on the availability of a certain time series, so that I can set > up e.g. an alert that fires if any node where I have e.g. some MegaCLI > based RAID, lacks megacli_some_metric. > > Or is there some other/better way this is done in practise? > > > Thanks, > Chris. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/34c6c4c7-a19e-423b-bae0-6a06ecc971aen%40googlegroups.com.

