Hi Jesús,
If your job has checkpointing enabled, you can monitor
'numberOfCompletedCheckpoints' to see wether the job is still alive and
healthy.

Thanks,
Zhu Zhu

Jesús Vásquez <jesusvasquezr1...@gmail.com> 于2019年12月17日周二 上午2:43写道:

> The thing about numRunningJobs metric is that i have to configure in
> advance the Prometheus rules with the number of jobs i expect to be running
> in order to alert, i kind of need this rule to alert on individual jobs. I
> initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it
> resulted that the metric just emits 0 on running jobs, and doesn't emit -1
> for failed jobs.
>
> El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir <
> shakir_poolakkalmukk...@comcast.com> escribió:
>
>> You could use “flink_jobmanager_numRunningJobs” to check the number of
>> running jobs.
>>
>>
>>
>> Thanks
>>
>>
>>
>> *From: *Jesús Vásquez <jesusvasquezr1...@gmail.com>
>> *Date: *Monday, December 16, 2019 at 12:47 PM
>> *To: *"user@flink.apache.org" <user@flink.apache.org>
>> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question
>>
>>
>>
>> Hi,
>>
>> I want to monitor Flink Streaming jobs using Prometheus
>>
>> My first goal is to send alerts when a Flink job has failed.
>>
>> The thing is that looking at the documentation I haven't found a metric
>> that helps me defining an alerting rule.
>>
>> As a starting point i thought that the metric
>> flink_jobmanager_job_downtime could help since the doc says this metric
>> emits -1 for a completed job.
>>
>> But when i tested this i found out this doesn't work since the metric
>> always emits 0 and after the job is completed there is no metric.
>>
>> Has anyone managed to alert when flink job has failed with Prometheus?
>>
>> Thanks for your help.
>>
>

Reply via email to