With EMR/YARN, the cluster is definitely running in session mode. It exists 
independently of any job and continues running after the job exits.

Whether or not this is a bug in Flink, is it possible to get access to the 
metrics I'm asking about? Those would be useful even if this behavior is fixed.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Piper Piper <piperfl...@gmail.com>
Sent: Friday, November 22, 2019 9:10:41 PM
To: Kelly Smith <kell...@zillowgroup.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Metrics for Task States

I am trying to reason why this problem should occur (i.e. why Flink could not 
reject the job when it required more slots than were available).

Flink in production on EMR (YARN): Does this mean Flink was being run in Job 
mode or Session mode?

Thank you,

Piper

On Thu, Nov 21, 2019 at 4:56 PM Piper Piper 
<piperfl...@gmail.com<mailto:piperfl...@gmail.com>> wrote:
Thank you, Kelly!

On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith 
<kell...@zillowgroup.com<mailto:kell...@zillowgroup.com>> wrote:

Hi Piper,



The repro is pretty simple:

  *   Submit a job with parallelism set higher than YARN has resources to 
support



What this ends up looking like in the Flink UI is this:
[cid:16e8fd359da4cff311]



The Job is in a “RUNNING” state, but all of the tasks are in the “SCHEDULED” 
state. The `jobmanager.numRunningJobs` metric that Flink emits by default will 
increase by 1, but none of the tasks actually get scheduled on any TM.



[cid:16e8fd359db5b16b22]



What I’m looking for is a way to detect when I am in this state using Flink 
metrics (ideally the count of tasks in each state for better observability).



Does that make sense?



Thanks,

Kelly



From: Piper Piper <piperfl...@gmail.com<mailto:piperfl...@gmail.com>>
Date: Thursday, November 21, 2019 at 12:59 PM
To: Kelly Smith <kell...@zillowgroup.com<mailto:kell...@zillowgroup.com>>
Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: Metrics for Task States



Hello Kelly,



I thought that Flink scheduler only starts a job if all requested 
containers/TMs are available and allotted to that job.



How can I reproduce your issue on Flink with YARN?



Thank you,



Piper





On Thu, Nov 21, 2019, 1:48 PM Kelly Smith 
<kell...@zillowgroup.com<mailto:kell...@zillowgroup.com>> wrote:

I’ve been running Flink in production on EMR (YARN) for some time and have 
found the metrics system to be quite useful, but there is one specific case 
where I’m missing a signal for this scenario:



  *   When a job has been submitted, but YARN does not have enough resources to 
provide



Observed:

  *   Job is in RUNNING state
  *   All of the tasks for the job are in the (I believe) DEPLOYING state



Is there a way to access these as metrics for monitoring the number of tasks in 
each state for a given job (image below)? The metric I’m currently using is the 
number of running jobs, but it misses this “unhealthy” scenario. I realize that 
I could use application-level metrics (record counts, etc) as a proxy for this, 
but I’m working on providing a streaming platform and need all of my monitoring 
to be application agnostic.

[cid:image001.png@01D5A059.19DB3EB0]



I can’t find anything on it in the documentation.



Thanks,

Kelly

Reply via email to