I am trying to reason why this problem should occur (i.e. why Flink could
not reject the job when it required more slots than were available).

Flink in production on EMR (YARN): Does this mean Flink was being run in
Job mode or Session mode?

Thank you,

Piper

On Thu, Nov 21, 2019 at 4:56 PM Piper Piper <piperfl...@gmail.com> wrote:

> Thank you, Kelly!
>
> On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <kell...@zillowgroup.com>
> wrote:
>
>> Hi Piper,
>>
>>
>>
>> The repro is pretty simple:
>>
>>    - Submit a job with parallelism set higher than YARN has resources to
>>    support
>>
>>
>>
>> What this ends up looking like in the Flink UI is this:
>>
>>
>>
>> The Job is in a “RUNNING” state, but all of the tasks are in the
>> “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits
>> by default will increase by 1, but none of the tasks actually get scheduled
>> on any TM.
>>
>>
>>
>>
>>
>> What I’m looking for is a way to detect when I am in this state using
>> Flink metrics (ideally the count of tasks in each state for better
>> observability).
>>
>>
>>
>> Does that make sense?
>>
>>
>>
>> Thanks,
>>
>> Kelly
>>
>>
>>
>> *From: *Piper Piper <piperfl...@gmail.com>
>> *Date: *Thursday, November 21, 2019 at 12:59 PM
>> *To: *Kelly Smith <kell...@zillowgroup.com>
>> *Cc: *"user@flink.apache.org" <user@flink.apache.org>
>> *Subject: *Re: Metrics for Task States
>>
>>
>>
>> Hello Kelly,
>>
>>
>>
>> I thought that Flink scheduler only starts a job if all requested
>> containers/TMs are available and allotted to that job.
>>
>>
>>
>> How can I reproduce your issue on Flink with YARN?
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Piper
>>
>>
>>
>>
>>
>> On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <kell...@zillowgroup.com>
>> wrote:
>>
>> I’ve been running Flink in production on EMR (YARN) for some time and
>> have found the metrics system to be quite useful, but there is one specific
>> case where I’m missing a signal for this scenario:
>>
>>
>>
>>    - When a job has been submitted, but YARN does not have enough
>>    resources to provide
>>
>>
>>
>> Observed:
>>
>>    - Job is in RUNNING state
>>    - All of the tasks for the job are in the (I believe) DEPLOYING state
>>
>>
>>
>> Is there a way to access these as metrics for monitoring the number of
>> tasks in each state for a given job (image below)? The metric I’m currently
>> using is the number of running jobs, but it misses this “unhealthy”
>> scenario. I realize that I could use application-level metrics (record
>> counts, etc) as a proxy for this, but I’m working on providing a streaming
>> platform and need all of my monitoring to be application agnostic.
>>
>> [image: cid:image001.png@01D5A059.19DB3EB0]
>>
>>
>>
>> I can’t find anything on it in the documentation.
>>
>>
>>
>> Thanks,
>>
>> Kelly
>>
>>

Reply via email to