I am trying to reason why this problem should occur (i.e. why Flink could not reject the job when it required more slots than were available).
Flink in production on EMR (YARN): Does this mean Flink was being run in Job mode or Session mode? Thank you, Piper On Thu, Nov 21, 2019 at 4:56 PM Piper Piper <piperfl...@gmail.com> wrote: > Thank you, Kelly! > > On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <kell...@zillowgroup.com> > wrote: > >> Hi Piper, >> >> >> >> The repro is pretty simple: >> >> - Submit a job with parallelism set higher than YARN has resources to >> support >> >> >> >> What this ends up looking like in the Flink UI is this: >> >> >> >> The Job is in a “RUNNING” state, but all of the tasks are in the >> “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits >> by default will increase by 1, but none of the tasks actually get scheduled >> on any TM. >> >> >> >> >> >> What I’m looking for is a way to detect when I am in this state using >> Flink metrics (ideally the count of tasks in each state for better >> observability). >> >> >> >> Does that make sense? >> >> >> >> Thanks, >> >> Kelly >> >> >> >> *From: *Piper Piper <piperfl...@gmail.com> >> *Date: *Thursday, November 21, 2019 at 12:59 PM >> *To: *Kelly Smith <kell...@zillowgroup.com> >> *Cc: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *Re: Metrics for Task States >> >> >> >> Hello Kelly, >> >> >> >> I thought that Flink scheduler only starts a job if all requested >> containers/TMs are available and allotted to that job. >> >> >> >> How can I reproduce your issue on Flink with YARN? >> >> >> >> Thank you, >> >> >> >> Piper >> >> >> >> >> >> On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <kell...@zillowgroup.com> >> wrote: >> >> I’ve been running Flink in production on EMR (YARN) for some time and >> have found the metrics system to be quite useful, but there is one specific >> case where I’m missing a signal for this scenario: >> >> >> >> - When a job has been submitted, but YARN does not have enough >> resources to provide >> >> >> >> Observed: >> >> - Job is in RUNNING state >> - All of the tasks for the job are in the (I believe) DEPLOYING state >> >> >> >> Is there a way to access these as metrics for monitoring the number of >> tasks in each state for a given job (image below)? The metric I’m currently >> using is the number of running jobs, but it misses this “unhealthy” >> scenario. I realize that I could use application-level metrics (record >> counts, etc) as a proxy for this, but I’m working on providing a streaming >> platform and need all of my monitoring to be application agnostic. >> >> [image: cid:image001.png@01D5A059.19DB3EB0] >> >> >> >> I can’t find anything on it in the documentation. >> >> >> >> Thanks, >> >> Kelly >> >>