High Availability on Yarn

Jain, Ankit Mon, 01 May 2017 10:00:29 -0700

Hi fellow users,
We are trying to straighten out high availability story for flink.

Our setup includes a long running EMR cluster, job submission is a two-step
process – 1) Flink cluster is first created using flink yarn client on the EMR
cluster already running 2) Flink job is submitted.

I also saw references that with 1.2, these two steps have been combined into 1
– is that change in FlinkYarnSessionCli.java? Can somebody point to
documentation please?

W/o worrying about Yarn RM (not Flink Yarn RM that seems to be newly
introduced) failure for now, I want to understand first how task manager & job
manager failures are handled.

My questions-

1)
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077
suggests a new RM has been added and now there is one JobManager for each job.
Since Yarn RM will now talk to Flink RM( instead of JobManager previously),
will Yarn automatically restart failing Flink RM?

2) Is there any documentation on behavior of new Flink RM that will come
up? How will previously running JobManagers & TaskManagers find out about new
RM?

3)
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/jobmanager_high_availability.html#configuration
requires configuring Zookeeper even for Yarn – Is this needed for handling
Task Manager failures or JM or both? Will Yarn not take care of JM failures?

It may sound like I am little confused between role of Yarn and Flink
components– who has the most burden of HA? Documentation in current state is
lacking clarity – I know it is still evolving.

Please let me know if somebody can help clear the confusion.

Thanks
Ankit

High Availability on Yarn

Reply via email to