I’ve been running Flink in production on EMR (YARN) for some time and have found the metrics system to be quite useful, but there is one specific case where I’m missing a signal for this scenario:
* When a job has been submitted, but YARN does not have enough resources to provide Observed: * Job is in RUNNING state * All of the tasks for the job are in the (I believe) DEPLOYING state Is there a way to access these as metrics for monitoring the number of tasks in each state for a given job (image below)? The metric I’m currently using is the number of running jobs, but it misses this “unhealthy” scenario. I realize that I could use application-level metrics (record counts, etc) as a proxy for this, but I’m working on providing a streaming platform and need all of my monitoring to be application agnostic. [cid:image001.png@01D5A059.19DB3EB0] I can’t find anything on it in the documentation. Thanks, Kelly