Re: High Availability on Yarn

Robert Metzger Wed, 24 May 2017 06:44:00 -0700

Hi Ankit,

I'm sorry that nobody is responding to the message. I'll try to find
somebody.


On Tue, May 23, 2017 at 10:27 PM, Jain, Ankit <ankit.j...@here.com> wrote:

> Following up on this.
>
>
>
> *From: *"Jain, Ankit" <ankit.j...@here.com>
> *Date: *Tuesday, May 16, 2017 at 12:14 AM
>
> *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" <
> user@flink.apache.org>
> *Subject: *Re: High Availability on Yarn
>
>
>
> Bringing it back to list’s focus.
>
>
>
> *From: *"Jain, Ankit" <ankit.j...@here.com>
> *Date: *Thursday, May 11, 2017 at 1:19 PM
> *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" <
> user@flink.apache.org>
> *Subject: *Re: High Availability on Yarn
>
>
>
> Got the answer on #2, looks like that will work, still looking for
> suggestions on #1.
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *"Jain, Ankit" <ankit.j...@here.com>
> *Date: *Thursday, May 11, 2017 at 8:26 AM
> *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" <
> user@flink.apache.org>
> *Subject: *Re: High Availability on Yarn
>
>
>
> Following up further on this.
>
>
>
> 1)      We are using a long running EMR cluster to submit jobs right now
> and as you know EMR hasn’t made Yarn ResourceManager HA.
>
> Is there any way we can use the information put in Zookeeper by Flink Job
> Manager to bring the jobs back up on a new EMR cluster if RM goes down?
>
>
>
> We are not looking for completely automated option but maybe write a
> script which reads Zookeeper and re-starts all jobs on a fresh EMR cluster?
>
> I am assuming if  Yarn ResouceManager goes down, there is no way to just
> bring it back up – you have to start a new EMR cluster?
>
>
>
> 2)      Regarding elasticity, I know for now a running flink cluster
> can’t make use of new hosts added to EMR but can I am guessing Yarn will
> still see the new hosts and new flink jobs can make use it, is that right?
>
>
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *"Jain, Ankit" <ankit.j...@here.com>
> *Date: *Monday, May 8, 2017 at 9:09 AM
> *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" <
> user@flink.apache.org>
> *Subject: *Re: High Availability on Yarn
>
>
>
> Thanks Stephan – we will go with a central ZooKeeper Instance and
> hopefully have it started through a cloudformation script as part of EMR
> startup.
>
>
>
> Is Zk also used to keep track of checkpoint metadata and the execution
> graph of the running job to recover from ApplicationMaster failure as
> Aljoscha was guessing below or only for leader election in case of
> accidently running multiple Application Masters ?
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *Stephan Ewen <se...@apache.org>
> *Date: *Monday, May 8, 2017 at 9:00 AM
> *To: *"user@flink.apache.org" <user@flink.apache.org>, "Jain, Ankit" <
> ankit.j...@here.com>
> *Subject: *Re: High Availability on Yarn
>
>
>
> @Ankit:
>
>
>
> ZooKeeper is required in YARN setups still. Even if there is only one
> JobManager in the normal case, Yarn can accidentally create a second one
> when there is a network partition.
>
> To prevent that this leads to inconsistencies, we use ZooKeeper.
>
>
>
> Flink uses ZooKeeper very little, so you can just let Flink attach to any
> existing ZooKeeper, or user one ZooKeeper cluster for very many Flink
> clusters/jobs.
>
>
>
> Stephan
>
>
>
>
>
> On Mon, May 8, 2017 at 2:11 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> Hi,
>
> Yes, it’s recommended to use one ZooKeeper cluster for all Flink clusters.
>
>
>
> Best,
>
> Aljoscha
>
>
>
> On 5. May 2017, at 16:56, Jain, Ankit <ankit.j...@here.com> wrote:
>
>
>
> Thanks for the update Aljoscha.
>
>
>
> @Till Rohrmann <trohrm...@apache.org>,
>
> Can you please chim in?
>
>
>
> Also, we currently have a long running EMR cluster where we create one
> flink cluster per job – can we just choose to install Zookeeper when
> creating the EMR cluster and use one Zookeeper instance for ALL of flink
> jobs?
>
> Or
>
> Recommendation is to have a dedicated Zookeeper instance per flink job?
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *Aljoscha Krettek <aljos...@apache.org>
> *Date: *Thursday, May 4, 2017 at 1:19 AM
> *To: *"Jain, Ankit" <ankit.j...@here.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>, Till Rohrmann <
> trohrm...@apache.org>
> *Subject: *Re: High Availability on Yarn
>
>
>
> Hi,
>
> Yes, for YARN there is only one running JobManager. As far as I Know, In
> this case ZooKeeper is only used to keep track of checkpoint metadata and
> the execution graph of the running job. Such that a restoring JobManager
> can pick up the data again.
>
>
>
> I’m not 100 % sure on this, though, so maybe Till can shed some light on
> this.
>
>
>
> Best,
>
> Aljoscha
>
> On 3. May 2017, at 16:58, Jain, Ankit <ankit.j...@here.com> wrote:
>
>
>
> Thanks for your reply Aljoscha.
>
>
>
> After building better understanding of Yarn and spending copious amount of
> time on Flink codebase, I think I now get how Flink & Yarn interact – I
> plan to document this soon in case it could help somebody starting afresh
> with Flink-Yarn.
>
>
>
> Regarding Zookeper, in YARN mode there is only one JobManager running, do
> we still need leader election?
>
>
>
> If the ApplicationMaster goes down (where JM runs) it is restarted by Yarn
> RM and while restarting, Flink AM will bring back previous running
> containers.  So, where does Zookeeper sit in this setup?
>
>
>
> Thanks
>
> Ankit
>
>
>
> *From: *Aljoscha Krettek <aljos...@apache.org>
> *Date: *Wednesday, May 3, 2017 at 2:05 AM
> *To: *"Jain, Ankit" <ankit.j...@here.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>, Till Rohrmann <
> trohrm...@apache.org>
> *Subject: *Re: High Availability on Yarn
>
>
>
> Hi,
>
> As a first comment, the work mentioned in the FLIP-6 doc you linked is
> still work-in-progress. You cannot use these abstractions yet without going
> into the code and setting up a cluster “by hand”.
>
>
>
> The documentation for one-step deployment of a Job to YARN is available
> here: https://ci.apache.org/projects/flink/flink-docs-
> release-1.2/setup/yarn_setup.html#run-a-single-flink-job-on-yarn
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.2%2Fsetup%2Fyarn_setup.html%23run-a-single-flink-job-on-yarn&data=01%7C01%7C%7Cfb13823970ba4476ebbf08d49203846c%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=fzBiQtv7MR2%2Fehg6GepwPa1uWxpqEgPJakto2B8k0Zk%3D&reserved=0>
>
>
>
> Regarding your third question, ZooKeeper is mostly used for discovery and
> leader election. That is, JobManagers use it to decide who is the main JM
> and who are standby JMs. TaskManagers use it to discover the leading
> JobManager that they should connect to.
>
>
>
> I’m also cc’ing Till, who should know this stuff better and can maybe
> explain it in a bit more detail.
>
>
>
> Best,
>
> Aljoscha
>
> On 1. May 2017, at 18:59, Jain, Ankit <ankit.j...@here.com> wrote:
>
>
>
> Hi fellow users,
>
> We are trying to straighten out high availability story for flink.
>
>
>
> Our setup includes a long running EMR cluster, job submission is a
> two-step process – 1) Flink cluster is first created using flink yarn
> client on the EMR cluster already running 2) Flink job is submitted.
>
>
>
> I also saw references that with 1.2, these two steps have been combined
> into 1 – is that change in FlinkYarnSessionCli.java? Can somebody point to
> documentation please?
>
>
>
> W/o worrying about Yarn RM (not Flink Yarn RM that seems to be newly
> introduced) failure for now, I want to understand first how task manager &
> job manager failures are handled.
>
>
>
> My questions-
>
> 1)       https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=65147077
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fpages%2Fviewpage.action%3FpageId%3D65147077&data=01%7C01%7C%7Cfb13823970ba4476ebbf08d49203846c%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=29Se7mWQZ09ukF3rkQNmSRPXY4RkA8RCNO4ec4Glj8I%3D&reserved=0>
>  suggests a new RM has been added and now there is one JobManager for
> each job. Since Yarn RM will now talk to Flink RM( instead of JobManager
> previously), will Yarn automatically restart failing Flink RM?
>
> 2)       Is there any documentation on behavior of new Flink RM that will
> come up? How will previously running JobManagers & TaskManagers find out
> about new RM?
>
> 3)       https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/setup/jobmanager_high_availability.html#configuration
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.3%2Fsetup%2Fjobmanager_high_availability.html%23configuration&data=01%7C01%7C%7Cfb13823970ba4476ebbf08d49203846c%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=nYTYWaaWA4T1D7EwvL%2B7mwhrVcqn6xTzCv8SS6x%2FqLM%3D&reserved=0>
>  requires configuring Zookeeper even for Yarn – Is this needed for
> handling Task Manager failures or JM or both? Will Yarn not take care of JM
> failures?
>
>
>
> It may sound like I am little confused between role of Yarn and Flink
> components– who has the most burden of HA? Documentation in current state
> is lacking clarity – I know it is still evolving.
>
>
>
> Please let me know if somebody can help clear the confusion.
>
>
>
> Thanks
>
> Ankit
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: High Availability on Yarn

Reply via email to