Hi there, I just watched the flink forward talk from Amazon regarding measuring uptime from [1] with slides here [2] and referencing the developer mailing list here [3].
Seems like Amazon is already running with those metrics enabled in their production cluster. I'd really like to have those statusses available in our flink deployment as well. The author at AWS links 3 different design docs and the mailing list found came up with the best way to implement those metrics would be kind of a job status listener in the JobManager (call it jobstatus, incident or whatever). However, I was not able to find a JIRA issue(s) for this story and I am also not able to find anything if this is now implemented, planned or rejected. Does anyone of you know more about it and whether there is such a listener somewhere in the JobManager? Best regards Theo [1] [ https://www.youtube.com/watch?v=pIVmw1HyUqU | https://www.youtube.com/watch?v=pIVmw1HyUqU ] [2] https://de.slideshare.net/FlinkForward/virtual-flink-forward-2020-lessons-learned-on-apache-flink-application-availability-in-a-hosted-apache-flink-service-praveen-gattu-hwanju-kim-ryan-nienhuis [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Proposal-for-Flink-job-execution-availability-metrics-impovement-td28882.html#a28962