Hi Piper, The repro is pretty simple:
* Submit a job with parallelism set higher than YARN has resources to support What this ends up looking like in the Flink UI is this: [cid:[email protected]] The Job is in a “RUNNING” state, but all of the tasks are in the “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits by default will increase by 1, but none of the tasks actually get scheduled on any TM. [cid:[email protected]] What I’m looking for is a way to detect when I am in this state using Flink metrics (ideally the count of tasks in each state for better observability). Does that make sense? Thanks, Kelly From: Piper Piper <[email protected]> Date: Thursday, November 21, 2019 at 12:59 PM To: Kelly Smith <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: Metrics for Task States Hello Kelly, I thought that Flink scheduler only starts a job if all requested containers/TMs are available and allotted to that job. How can I reproduce your issue on Flink with YARN? Thank you, Piper On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <[email protected]<mailto:[email protected]>> wrote: I’ve been running Flink in production on EMR (YARN) for some time and have found the metrics system to be quite useful, but there is one specific case where I’m missing a signal for this scenario: * When a job has been submitted, but YARN does not have enough resources to provide Observed: * Job is in RUNNING state * All of the tasks for the job are in the (I believe) DEPLOYING state Is there a way to access these as metrics for monitoring the number of tasks in each state for a given job (image below)? The metric I’m currently using is the number of running jobs, but it misses this “unhealthy” scenario. I realize that I could use application-level metrics (record counts, etc) as a proxy for this, but I’m working on providing a streaming platform and need all of my monitoring to be application agnostic. [cid:[email protected]] I can’t find anything on it in the documentation. Thanks, Kelly
