The thing about numRunningJobs metric is that i have to configure in advance the Prometheus rules with the number of jobs i expect to be running in order to alert, i kind of need this rule to alert on individual jobs. I initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it resulted that the metric just emits 0 on running jobs, and doesn't emit -1 for failed jobs.
El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir < shakir_poolakkalmukk...@comcast.com> escribió: > You could use “flink_jobmanager_numRunningJobs” to check the number of > running jobs. > > > > Thanks > > > > *From: *Jesús Vásquez <jesusvasquezr1...@gmail.com> > *Date: *Monday, December 16, 2019 at 12:47 PM > *To: *"user@flink.apache.org" <user@flink.apache.org> > *Subject: *[EXTERNAL] Flink and Prometheus monitoring question > > > > Hi, > > I want to monitor Flink Streaming jobs using Prometheus > > My first goal is to send alerts when a Flink job has failed. > > The thing is that looking at the documentation I haven't found a metric > that helps me defining an alerting rule. > > As a starting point i thought that the metric > flink_jobmanager_job_downtime could help since the doc says this metric > emits -1 for a completed job. > > But when i tested this i found out this doesn't work since the metric > always emits 0 and after the job is completed there is no metric. > > Has anyone managed to alert when flink job has failed with Prometheus? > > Thanks for your help. >