On Thursday, 21 April 2022 at 09:22:32 UTC+1 [email protected] wrote:
> *blackbox exporter config:*
> icmp:
> prober: icmp
> icmp:
> preferred_ip_protocol: "ip4"
> tcp:
> prober: tcp
> timeout: 5s
> tcp:
> preferred_ip_protocol: "ip4"
>
> *Prometheus scrape config:*
>
...
> - job_name: SSH
> metrics_path: /probe
> params:
> * module: [ssh_banner]*
> file_sd_configs:
> - files:
> - '/etc/prometheus/targets/'
> relabel_configs:
> - source_labels: [__address__]
> target_label: __param_target
> regex: '([^:]+)(:[0-9]+)?'
> replacement: '${1}:22'
> - source_labels: [__param_target]
> target_label: instance
> - target_label: __address__
> replacement: prometheus-blackbox-exporter:9115
>
In your scrape job you are setting parameter module=ssh_banner, but you
have not defined a module called "ssh_banner" in your blackbox exporter
config.
Therefore it will always result in a failure. Test like this:
*curl -g
'http://prometheus-blackbox-exporter:9115/probe?module=ssh_banner&target=blah.example.com&debug=true'*
> *Alert rules:*
> - alert: TargetDown
> expr: probe_success == 0
> for: 5s
> labels:
> severity: critical
> annotations:
> description: Service {{ $labels.instance }} is unreachable.
> value: DOWN ({{ $value }})
> summary: "Target {{ $labels.instance }} is down."
>
>
You can leave out "for: 5s" since you're only scraping and evaluating rules
every 60s.
If you don't want an immediate alert in the case of a single probe failure
(like a single dropped packet), then set "for: 1m" or "for: 2m" as
required. This will then only alert if the alert is continuously present
for that duration.
> *Alert manager config:*
> ...
> - name: email-me
> email_configs:
> - to: alert
> send_resolved: true
>
>
In your original post you said "but black box exporter detect the recover
behavior after about 5mins". Are you talking about when you receive the
"send_resolved" message from alertmanager?
There are various delays which can occur between prometheus making an alert
and alertmanager sending it, and also with prometheus withdrawing an alert
and alertmanager sending a resolved message.
If I understand correctly: Prometheus doesn't explicitly "resolve" an
alert, rather it just stops sending that alert. The alert comes with an
"endsAt" time, which is explained here:
https://github.com/prometheus/prometheus/issues/5277
"3x
<https://github.com/prometheus/prometheus/blob/f678e27eb62ecf56e2b0bad82345925a4d6162a2/rules/alerting.go#L450>
the
greater of the evaluation_interval or resend-delay values"
Since you have an evaluation_interval of 60s, I believe this means there
will be at least a 3 minute delay between an alert ceasing to fire, and the
resolved message being sent.
See also:
https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html
https://prometheus.io/docs/alerting/latest/clients/
https://prometheus.io/docs/alerting/latest/configuration/#configuration-file
# ResolveTimeout is the default value used by alertmanager if the alert does
# not include EndsAt, after this time passes it can declare the alert as
resolved if it has not been updated.
# This has no impact on alerts from Prometheus, as they always include
EndsAt.
[ resolve_timeout: <duration>
<https://prometheus.io/docs/alerting/latest/configuration/#duration> |
default = 5m ]
Really I think you need to separate your problem into two parts:
1. Making sure that blackbox_exporter is probing ICMP and SSH
successfully. Check "probe_status" is going to 0 or 1 at the correct
times. View the PromQL history of the probe_status metric to confirm
this. Ignore alerts.
2. Then look at your alerting configuration, as to exactly when it sends
messages.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/cd3aa371-e968-4b44-98a5-326c3da1a487n%40googlegroups.com.