Hi Jesús, If your job has checkpointing enabled, you can monitor 'numberOfCompletedCheckpoints' to see wether the job is still alive and healthy.
Thanks, Zhu Zhu Jesús Vásquez <jesusvasquezr1...@gmail.com> 于2019年12月17日周二 上午2:43写道: > The thing about numRunningJobs metric is that i have to configure in > advance the Prometheus rules with the number of jobs i expect to be running > in order to alert, i kind of need this rule to alert on individual jobs. I > initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it > resulted that the metric just emits 0 on running jobs, and doesn't emit -1 > for failed jobs. > > El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir < > shakir_poolakkalmukk...@comcast.com> escribió: > >> You could use “flink_jobmanager_numRunningJobs” to check the number of >> running jobs. >> >> >> >> Thanks >> >> >> >> *From: *Jesús Vásquez <jesusvasquezr1...@gmail.com> >> *Date: *Monday, December 16, 2019 at 12:47 PM >> *To: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *[EXTERNAL] Flink and Prometheus monitoring question >> >> >> >> Hi, >> >> I want to monitor Flink Streaming jobs using Prometheus >> >> My first goal is to send alerts when a Flink job has failed. >> >> The thing is that looking at the documentation I haven't found a metric >> that helps me defining an alerting rule. >> >> As a starting point i thought that the metric >> flink_jobmanager_job_downtime could help since the doc says this metric >> emits -1 for a completed job. >> >> But when i tested this i found out this doesn't work since the metric >> always emits 0 and after the job is completed there is no metric. >> >> Has anyone managed to alert when flink job has failed with Prometheus? >> >> Thanks for your help. >> >