[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Brian Candler Wed, 26 Apr 2023 00:35:37 -0700

> expr: up{job="myjob"} == 1 unless my_metric

Beware with that, that it will only work if the labels on both 'up' and 
'my_metric' match exactly.  If they don't, then you can either use on(...) 
to specify the set of labels which match, or ignoring(...) to specify the 
ones which don't.


You could start with:

expr: up{job="myjob"} == 1 unless on (instance) my_metric

but I believe this will break if there are multiple instances of my_metric 
for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) 
(my_metric)

> So my_metric would return "something" as soon as it was contained (in the 
most recent scrape!)... and if it wasn't, up{job="myjob"} == 1 would 
silence the "extra" error, in case it is NOT up anyway.

Yes, if up == 0 (i.e. the target is down) then you don't want an additional 
alert saying the metric is missing, as obviously it will be.

> So in that case one should do always both:
> - in general, check for any targets/jobs that are not up
> - in specific (for e.g. very important metrics), additionally check for 
the specific metric.
>  Right?

Yes, if there's any chance that the metric could be missing in a "good" 
scrape. This is rarely the case.

You mention MegaCLI: if you're using the node_exporter textfile collector 
scripts to collect information on the RAID card, then you can use the 
timestamp metric I mentioned before to check that the script has run 
recently.  If you forgot to install the script, then sure you won't get any 
metrics.  If you want to alert on this specific bad setup, then obviously 
you'll need a list of targets which *should* have MegaRAID metrics - in 
which case, you might just use this list with your configuration management 
system (e.g. ansible or whatever).

> In general, when I get the value of some time series like 
node_cpu_seconds_total ... when that is missing for e.g. one instance I 
would get nothing, right? I.e. there is no special value, just the vector 
of scalar has one element less. 

Again, I'd consider it unlikely that a successful scrape from node_exporter 
would silently drop node_cpu_seconds_total metrics.

If you're talking about the instance vector across all targets, i.e. the 
PromQL expression "node_cpu_seconds_total", then yes the vector will 
include all known values.

> But if I do get a value, it's for sure the one from the most recent 
scrape?!

Yes. Google "prometheus staleness handling".  Basically when you evaluate 
an instant vector query it's done at some time T (by default "now"), and in 
the TSDB it looks for the most recent value of the metric, looking back up 
to 5 minutes (default). Also, if a scrape does not contain a particular 
timeseries, but the previous scrape *did* contain that timeseries, then the 
timeseries is marked "stale" by storing a staleness marker.

So if you do see a value, it means:
- it was in the last scrape
- it was in the last 5 minutes
- there has not been a subsequent scrape where the timeseries was missing

> Is this with absent() also needed when I have all my targets/jobs 
statically configured?

Use absent() when you need to write an expression which you can't do as a 
join against another existing timeseries.

>    expr: foo != foo offset 5m


> That's however a really good idea... and quite simple (AFAIU it should 
work like that out of the box for all possible instances, right?).
> But that would also fire once at initialisation, and when it then really 
fires... it would silence again after another 5 min (unless the could 
changes again), right?

Almost. It won't fire at initialisation, because foo != bar will give no 
results unless foo and bar both exist.

If you want to fire when foo exists not but did not exist 5 minutes ago 
(i.e. alert whenever a new metric is created), then

expr: foo unless foo offset 5m

And yes, it will silence after 5 minutes. You don't want to send recovery 
messages on such alerts. (Personally I don't send recovery messages for 
*any* alerts, but that's a different 
story: 
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
 
)

> How does that work via smartmon?

Sorry, that was my brainfart. It's "storcli.py" that you want.  (Although 
collecting smartmon info is a good idea too).

> OTOH, I would rather want to avoid writing my own exporters just for some 
RAID checks (=metrics).

Hopefully, scripting with node_exporter textfile collector will do the job 
easily enough.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b9636e87-c26b-4887-b178-b2ce5437eb28n%40googlegroups.com.

[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Reply via email to