Re: [prometheus-users] probe_success VS up

'Brian Candler' via Prometheus Users Tue, 28 Nov 2023 02:18:03 -0800

On Tuesday, 28 November 2023 at 04:15:41 UTC Chris Siebenmann wrote:

The Blackbox exporter is a bit tricky to understand in relation to up{}, 
because unlike many exporters you create multiple scrape targets against 
(or through) the same exporter. This generally means you want to ignore 
the up{} metric for any particular blackbox probe and instead scrape 
Blackbox's metric endpoint and pay attention to its up{} (for alerts, 
for example).

I think that's worded in a misleading way.

Blackbox exporter does have a /metrics endpoint, but this is only for
metrics internal to the operation of blackbox_exporter itself (e.g. memory
stats, software version). You don't need to scrape this, but it gives you a
little bit of extra info about how your exporter is performing.

Blackbox exporter's main interface is the /probe endpoint, where you tell
it to run individual tests: /probe?target=xxx&module=yyy

The 'up' metric is generated by Prometheus itself, and only tells you that
it was successfully able to communicate with the exporter and get some
results (without a 4xx / 5xx error for example). So it's correct to say
that you're not interested in the 'up' metric for scrapes to /probe, since
it will always be 1 unless blackbox_exporter itself is badly broken, and
you're interested in probe_success instead.

This is pretty easy to arrange in alerting rules. Here's a starting point:

groups:
- name: UpDown
rules:
- alert: UpDown
expr: up == 0
for: 3m
keep_firing_for: 3m
labels:
severity: critical
annotations:
summary: 'Scrape failed: host is down or scrape endpoint
down/unreachable'
- name: BlackboxRules
rules:
- alert: ProbeFail
expr: probe_success == 0
for: 3m
keep_firing_for: 3m
labels:
severity: critical
annotations:
description: |
{{ $labels.instance }} ({{ $labels.module }}) probe is failing
summary: Probed service is down

For Grafana I'd probably just make two dashboards, but if you really want a
grand summary of all "problems" then you can simply use a PromQL expression
like this:

up == 0 or probe_success == 0

The "or" operator
<https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>

in PromQL is not a boolean: it's more like a set union operator. It will
give you all the values of the "up" vector where the value is 0, along with
all values of the "probe_success" vector where the value is 0 (except for
values of probe_success == 0 which have *exactly* the same labels as up ==
0, but those are unlikely anyway)

The consumer of this query is going to see a mixture of up{...} and
probe_success{...} metrics, all with value 0.

there are other multi-target
indirect exporters like Blackbox. I believe that the SNMP exporter is
another one where you often have one exporter separately scraping a lot
of targets, and each target will have its own up{} metric that you
probably want to ignore.)

The first part of that is correct: SNMP exporter uses
/snmp?target=xxx&module=yyy&auth=zzz.

But the second part is wrong: if SNMP exporter fails to talk to the target
then it returns an empty scrape with a 4xx/5xx error code, which prometheus
turns into up==0. So you definitely *do* want to alert on up==0 in this
case, as that's how you detect a device which is failing to respond to SNMP.

In our environment, it's useful for us to have a granular view of what
has failed. That a device has stopped pinging is a different issue than
its node_exporter not being up, so our dashboards (and alerts) reflect
that.

I agree with that. Different metrics inherently have different meanings,
and although 'up' and 'probe_success' have similar 0/1 semantics, there's
other information you can get from blackbox_exporter when probe_success==0
which can tell you more about the nature of the problem (e.g. failure to
connect, failure to resolve DNS name, TLS certificate validation failure
etc)

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com.

Re: [prometheus-users] probe_success VS up

Reply via email to