Hi,
I want to monitor Flink Streaming jobs using Prometheus
My first goal is to send alerts when a Flink job has failed.
The thing is that looking at the documentation I haven't found a metric
that helps me defining an alerting rule.
As a starting point i thought that the metric flink_jobmanager_job_downtime
could help since the doc says this metric emits -1 for a completed job.
But when i tested this i found out this doesn't work since the metric
always emits 0 and after the job is completed there is no metric.
Has anyone managed to alert when flink job has failed with Prometheus?
Thanks for your help.