On Friday, 28 April 2023 at 03:41:19 UTC+1 Christoph Anton Mitterer wrote:
You could start with:
expr: up{job="myjob"} == 1 unless on (instance) my_metric
Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one
doesn't really know which labels may get added, right?
It's a matter of taste. I like to keep things simple, and to keep the rules
for different metrics similar to each other. You know that 'up' has only
job and instance labels plus any target labels from service discovery;
my_metric will have these plus some others which vary between metrics.
Also, wouldn't it be better to also consider the "job" label?
expr: up{job="myjob"} == 1 unless on (instance, job) my_metric
Again a matter of taste, but typically not needed: you already filtered to
a single job="myjob" on the LHS, and it would be very unusual for the same
"mymetric" to be received in scrapes from two different jobs (which usually
means two different exporters) for the same host.
In fact by that logic you might as well simplify it to
expr: up == 1 unless on (instance) my_metric
but I believe this will break if there are multiple instances of my_metric
for the same host. I'd probably do:
expr: up{job="myjob"} == 1 unless on (instance) count by (instance)
(my_metric)
So with job that would be:
expr: up{job="myjob"} == 1 unless on (instance,job) count by
(instance,job) (my_metric)
That should be fine.
but I don't quite understand why it's needed in the first place?!
It turns out you're right.
Prometheus provides a web interface where you can test all these
expressions: even alerting rules are just expressions, which alert if the
instant vector result is non-empty.
Suppose you wanted to alert if a node *isn't* returning
node_filesystem_avail_bytes. You can test it like this:
up{job="node"} == 1 unless on(instance,job) node_filesystem_avail_bytes
and you were right, "unless" doesn't care if there are multiple matches on
the right-hand side.
But suppose you wanted to do arithmetic between the metrics (note extra
parentheses required):
(up{job="node"} == 1) * on(instance,job) node_filesystem_avail_bytes
This will give you an error, because each instance/job combination on the
LHS matches multiple filesystems on the RHS. The correct syntax for that is:
(up{job="node"} == 1) * on(instance,job) group_right()
node_filesystem_avail_bytes
which gives N results for each 1:N combination.
This particular example is pointless because the LHS is always 1, so you're
just multiplying by 1. But even with a static metric like that there are
cases where you want labels from the LHS to be added to the result, which
you can do by listing those labels inside the group_right() clause.
https://www.robustperception.io/how-to-have-labels-for-machine-roles/
https://www.robustperception.io/exposing-the-software-version-to-prometheus
Also, if a scrape does not contain a particular timeseries, but the
previous scrape *did* contain that timeseries, then the timeseries is
marked "stale" by storing a staleness marker.
Is there a way to test for that marker in expressions?
Not usefully, and I don't see why you'd want to. Internally it's stored as
a special flavour of NaN, but in queries it just looks like the timeseries
has disappeared - which indeed it has. It stops you looking back in time to
the previous real scrape value.
Use absent() when you need to write an expression which you can't do as a
join against another existing timeseries.
Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the
above:
up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
absent(my_metric)
it would be empty as soon as there was at least one time series for the
metric.
With that I could really only check for a specific time series to be
missing like:
absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every
instance.
That's exactly what I mean. If you want to look for the absence of a
*specific* timeseries, you can do it that way. But it's very rare that I've
had to do that. If you do it with a join on 'up' then it will work for
multiple similar timeseries.
You would use absent() that if you wanted to test for complete absence of
up{job="myjob"} for example - which would mean that service discovery for
that job had returned zero targets.
Wouldn't it be better to do something like.
expr: foo unless foo offset 15s
for: 5m
assuming scrape interval of 15s?
Well yes, in reality that might be better; but remember that if a few
scrapes fail, "foo" will still be present in query results for 5 minutes
anyway, because of the lookback - so the original expression I gave is not
as fragile as you might think.
With offset I cannot just specify the "previous" sample, right?
Again, why would you want to?
As I said before, the value of a timeseries at time T is *defined* to be
the most recent value of the timeseries at or before time T, up to 5
minutes previously; so if a few scrapes fail, then the value is defined to
persist at the previous value for 5 minutes. This is a good thing: it helps
make the whole ecosystem more robust. You don't want
expr: foo offset 5m unless foo
to trigger if a single scrape fails. But if the previous scrape was
successful and did not include that metric, then it will immediately
vanish, so an expression like the above will trigger immediately.
At this point, I think it's best if we leave the discussion here, as it's
all getting rather theoretical - you clearly have plenty of clue to get
running with all this, and if you have a specific problem then you can
raise it here.
Regards,
Brian.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/e6c8ff70-a408-4acb-bddd-001e1094613cn%40googlegroups.com.