The thing about numRunningJobs metric is that i have to configure in
advance the Prometheus rules with the number of jobs i expect to be running
in order to alert, i kind of need this rule to alert on individual jobs. I
initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it
resulted that the metric just emits 0 on running jobs, and doesn't emit -1
for failed jobs.

El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir <
shakir_poolakkalmukk...@comcast.com> escribió:

> You could use “flink_jobmanager_numRunningJobs” to check the number of
> running jobs.
>
>
>
> Thanks
>
>
>
> *From: *Jesús Vásquez <jesusvasquezr1...@gmail.com>
> *Date: *Monday, December 16, 2019 at 12:47 PM
> *To: *"user@flink.apache.org" <user@flink.apache.org>
> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question
>
>
>
> Hi,
>
> I want to monitor Flink Streaming jobs using Prometheus
>
> My first goal is to send alerts when a Flink job has failed.
>
> The thing is that looking at the documentation I haven't found a metric
> that helps me defining an alerting rule.
>
> As a starting point i thought that the metric
> flink_jobmanager_job_downtime could help since the doc says this metric
> emits -1 for a completed job.
>
> But when i tested this i found out this doesn't work since the metric
> always emits 0 and after the job is completed there is no metric.
>
> Has anyone managed to alert when flink job has failed with Prometheus?
>
> Thanks for your help.
>

Reply via email to