[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Brian Candler Fri, 28 Apr 2023 00:02:00 -0700

On Friday, 28 April 2023 at 03:41:19 UTC+1 Christoph Anton Mitterer wrote:

You could start with:


expr: up{job="myjob"} == 1 unless on (instance) my_metric


Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one 
doesn't really know which labels may get added, right?


It's a matter of taste. I like to keep things simple, and to keep the rules 
for different metrics similar to each other. You know that 'up' has only 
job and instance labels plus any target labels from service discovery; 
my_metric will have these plus some others which vary between metrics.
 


Also, wouldn't it be better to also consider the "job" label?
   expr: up{job="myjob"} == 1 unless on (instance, job) my_metric


Again a matter of taste, but typically not needed: you already filtered to 
a single job="myjob" on the LHS, and it would be very unusual for the same 
"mymetric" to be received in scrapes from two different jobs (which usually 
means two different exporters) for the same host.

In fact by that logic you might as well simplify it to

expr: up == 1 unless on (instance) my_metric
 


but I believe this will break if there are multiple instances of my_metric 
for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) 
(my_metric)


So with job that would be:
   expr: up{job="myjob"} == 1 unless on (instance,job) count by 
(instance,job) (my_metric)


That should be fine.
 

 but I don't quite understand why it's needed in the first place?!


It turns out you're right.

Prometheus provides a web interface where you can test all these 
expressions: even alerting rules are just expressions, which alert if the 
instant vector result is non-empty.

Suppose you wanted to alert if a node *isn't* returning 
node_filesystem_avail_bytes.  You can test it like this:

    up{job="node"} == 1 unless on(instance,job) node_filesystem_avail_bytes

and you were right, "unless" doesn't care if there are multiple matches on 
the right-hand side.

But suppose you wanted to do arithmetic between the metrics (note extra 
parentheses required):

    (up{job="node"} == 1) * on(instance,job) node_filesystem_avail_bytes

This will give you an error, because each instance/job combination on the 
LHS matches multiple filesystems on the RHS. The correct syntax for that is:

    (up{job="node"} == 1) * on(instance,job) group_right() 
node_filesystem_avail_bytes

which gives N results for each 1:N combination.

This particular example is pointless because the LHS is always 1, so you're 
just multiplying by 1. But even with a static metric like that there are 
cases where you want labels from the LHS to be added to the result, which 
you can do by listing those labels inside the group_right() clause.

https://www.robustperception.io/how-to-have-labels-for-machine-roles/
https://www.robustperception.io/exposing-the-software-version-to-prometheus

 


Also, if a scrape does not contain a particular timeseries, but the 
previous scrape *did* contain that timeseries, then the timeseries is 
marked "stale" by storing a staleness marker.


 Is there a way to test for that marker in expressions?


Not usefully, and I don't see why you'd want to. Internally it's stored as 
a special flavour of NaN, but in queries it just looks like the timeseries 
has disappeared - which indeed it has. It stops you looking back in time to 
the previous real scrape value.
 

 Use absent() when you need to write an expression which you can't do as a 
join against another existing timeseries.

Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the 
above:
   up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
   absent(my_metric)
it would be empty as soon as there was at least one time series for the 
metric.
With that I could really only check for a specific time series to be 
missing like:
   absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every 
instance.


That's exactly what I mean. If you want to look for the absence of a 
*specific* timeseries, you can do it that way. But it's very rare that I've 
had to do that. If you do it with a join on 'up' then it will work for 
multiple similar timeseries.

You would use absent() that if you wanted to test for complete absence of 
up{job="myjob"} for example - which would mean that service discovery for 
that job had returned zero targets.
 

Wouldn't it be better to do something like.
   expr: foo unless foo offset 15s
   for: 5m
assuming scrape interval of 15s?


Well yes, in reality that might be better; but remember that if a few 
scrapes fail, "foo" will still be present in query results for 5 minutes 
anyway, because of the lookback - so the original expression I gave is not 
as fragile as you might think.
 

 With offset I cannot just specify the "previous" sample, right?


Again, why would you want to?

As I said before, the value of a timeseries at time T is *defined* to be 
the most recent value of the timeseries at or before time T, up to 5 
minutes previously; so if a few scrapes fail, then the value is defined to 
persist at the previous value for 5 minutes. This is a good thing: it helps 
make the whole ecosystem more robust. You don't want

    expr: foo offset 5m unless foo

to trigger if a single scrape fails. But if the previous scrape was 
successful and did not include that metric, then it will immediately 
vanish, so an expression like the above will trigger immediately.
 
At this point, I think it's best if we leave the discussion here, as it's 
all getting rather theoretical - you clearly have plenty of clue to get 
running with all this, and if you have a specific problem then you can 
raise it here.

Regards,

Brian.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e6c8ff70-a408-4acb-bddd-001e1094613cn%40googlegroups.com.

[prometheus-users] Re: how to make sure a metric is to be checked is "there"

Reply via email to