[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Christoph Anton Mitterer Thu, 27 Apr 2023 19:41:24 -0700

Hey again.

On Wednesday, April 26, 2023 at 9:35:32 AM UTC+2 Brian Candler wrote:


> expr: up{job="myjob"} == 1 unless my_metric

Beware with that, that it will only work if the labels on both 'up' and 
'my_metric' match exactly.  If they don't, then you can either use on(...) 
to specify the set of labels which match, or ignoring(...) to specify the 
ones which don't.

You could start with:

expr: up{job="myjob"} == 1 unless on (instance) my_metric


Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one 
doesn't really know which labels may get added, right?

Also, wouldn't it be better to also consider the "job" label?
   expr: up{job="myjob"} == 1 unless on (instance, job) my_metric
because AFAIU, job is set by Prometheus itself, so if I operate on it as 
well, I can make sure that my_metric is really from the desired job - an 
not perhaps from some other job that wrongly exports a metric of that name.
Does that make sense?

 

but I believe this will break if there are multiple instances of my_metric 
for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) 
(my_metric)


So with job that would be:
   expr: up{job="myjob"} == 1 unless on (instance,job) count by 
(instance,job) (my_metric)
 
but I don't quite understand why it's needed in the first place?!

If I do the previous:
  expr: up{job="myjob"} == 1 unless on (instance) my_metric
then even if for one given instance value (and optionally one given job 
value) there are multiple results for my_metric (just differing in other 
labels), like:
   
node_filesystem_free_bytes{device="/dev/vda1",fstype="vfat",mountpoint="/boot/efi"}
 
5.34147072e+08
   
node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/"} 
1.2846592e+10
   
node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/data/btrfs-top-level-subvolumes/system"}
 
1.2846592e+10
(all with the same instance/job)

shouldn't the "unless on (instance)" still work? I mean it wouldn't notice 
if only one time series were gone (like e.g. only device="/dev/vda1" 
above), but it should if all were gone?
But the count by would also only notice it if all were gone, because only 
then it gives back no data for the respective instance (and not just 0 as 
value)?


Also, if a scrape does not contain a particular timeseries, but the 
previous scrape *did* contain that timeseries, then the timeseries is 
marked "stale" by storing a staleness marker.


 Is there a way to test for that marker in expressions?

 

So if you do see a value, it means:
- it was in the last scrape
- it was in the last 5 minutes 

- there has not been a subsequent scrape where the timeseries was missing


Ah, good to know.
 

> Is this with absent() also needed when I have all my targets/jobs 
statically configured?

Use absent() when you need to write an expression which you can't do as a 
join against another existing timeseries.


Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the 
above:
   up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
   absent(my_metric)
it would be empty as soon as there was at least one time series for the 
metric.
With that I could really only check for a specific time series to be 
missing like:
   absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every 
instance.

Or is there any way to use absent() for the general case which I just don't 
see?


If you want to fire when foo exists not but did not exist 5 minutes ago 
(i.e. alert whenever a new metric is created), then

expr: foo unless foo offset 5m


No I think I'd only want alerts if something vanishes.
 

And yes, it will silence after 5 minutes. You don't want to send recovery 
messages on such alerts.


Sounds reasonable.

I wonder whether the expression is ideal:
The above form would already fire, even if the value was missing just once, 
exactly the 5m ago.
Wouldn't it be better to do something like.
   expr: foo unless foo offset 15s
   for: 5m
assuming scrape interval of 15s?

With offset I cannot just specify the "previous" sample, right?

Is it somehow possible to do the above like automatically for all metrics 
(and not just foo) from one expression?
And I guess one would again need to link that somehow with `up` to avoid 
useless errors?

 

> How does that work via smartmon?
Sorry, that was my brainfart. It's "storcli.py" that you want.  (Although 
collecting smartmon info is a good idea too).


Ah... I even saw that too, but had totally forgotten that they've renamed 
megacli.


Is there a list of some generally useful alerts, things like:
   up == 0
or like the above idea of checking for metrics that have vanished? Ideally 
with how to use them properly ;-)


 

Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/feb4c4eb-0b5c-4dcf-bc24-25fc4a90ef42n%40googlegroups.com.

[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Reply via email to