Hey again.
On Wednesday, April 26, 2023 at 9:35:32 AM UTC+2 Brian Candler wrote:
> expr: up{job="myjob"} == 1 unless my_metric
Beware with that, that it will only work if the labels on both 'up' and
'my_metric' match exactly. If they don't, then you can either use on(...)
to specify the set of labels which match, or ignoring(...) to specify the
ones which don't.
You could start with:
expr: up{job="myjob"} == 1 unless on (instance) my_metric
Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one
doesn't really know which labels may get added, right?
Also, wouldn't it be better to also consider the "job" label?
expr: up{job="myjob"} == 1 unless on (instance, job) my_metric
because AFAIU, job is set by Prometheus itself, so if I operate on it as
well, I can make sure that my_metric is really from the desired job - an
not perhaps from some other job that wrongly exports a metric of that name.
Does that make sense?
but I believe this will break if there are multiple instances of my_metric
for the same host. I'd probably do:
expr: up{job="myjob"} == 1 unless on (instance) count by (instance)
(my_metric)
So with job that would be:
expr: up{job="myjob"} == 1 unless on (instance,job) count by
(instance,job) (my_metric)
but I don't quite understand why it's needed in the first place?!
If I do the previous:
expr: up{job="myjob"} == 1 unless on (instance) my_metric
then even if for one given instance value (and optionally one given job
value) there are multiple results for my_metric (just differing in other
labels), like:
node_filesystem_free_bytes{device="/dev/vda1",fstype="vfat",mountpoint="/boot/efi"}
5.34147072e+08
node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/"}
1.2846592e+10
node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/data/btrfs-top-level-subvolumes/system"}
1.2846592e+10
(all with the same instance/job)
shouldn't the "unless on (instance)" still work? I mean it wouldn't notice
if only one time series were gone (like e.g. only device="/dev/vda1"
above), but it should if all were gone?
But the count by would also only notice it if all were gone, because only
then it gives back no data for the respective instance (and not just 0 as
value)?
Also, if a scrape does not contain a particular timeseries, but the
previous scrape *did* contain that timeseries, then the timeseries is
marked "stale" by storing a staleness marker.
Is there a way to test for that marker in expressions?
So if you do see a value, it means:
- it was in the last scrape
- it was in the last 5 minutes
- there has not been a subsequent scrape where the timeseries was missing
Ah, good to know.
> Is this with absent() also needed when I have all my targets/jobs
statically configured?
Use absent() when you need to write an expression which you can't do as a
join against another existing timeseries.
Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the
above:
up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
absent(my_metric)
it would be empty as soon as there was at least one time series for the
metric.
With that I could really only check for a specific time series to be
missing like:
absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every
instance.
Or is there any way to use absent() for the general case which I just don't
see?
If you want to fire when foo exists not but did not exist 5 minutes ago
(i.e. alert whenever a new metric is created), then
expr: foo unless foo offset 5m
No I think I'd only want alerts if something vanishes.
And yes, it will silence after 5 minutes. You don't want to send recovery
messages on such alerts.
Sounds reasonable.
I wonder whether the expression is ideal:
The above form would already fire, even if the value was missing just once,
exactly the 5m ago.
Wouldn't it be better to do something like.
expr: foo unless foo offset 15s
for: 5m
assuming scrape interval of 15s?
With offset I cannot just specify the "previous" sample, right?
Is it somehow possible to do the above like automatically for all metrics
(and not just foo) from one expression?
And I guess one would again need to link that somehow with `up` to avoid
useless errors?
> How does that work via smartmon?
Sorry, that was my brainfart. It's "storcli.py" that you want. (Although
collecting smartmon info is a good idea too).
Ah... I even saw that too, but had totally forgotten that they've renamed
megacli.
Is there a list of some generally useful alerts, things like:
up == 0
or like the above idea of checking for metrics that have vanished? Ideally
with how to use them properly ;-)
Thanks,
Chris.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/feb4c4eb-0b5c-4dcf-bc24-25fc4a90ef42n%40googlegroups.com.