Hi Ankit, I'm sorry that nobody is responding to the message. I'll try to find somebody.
On Tue, May 23, 2017 at 10:27 PM, Jain, Ankit <ankit.j...@here.com> wrote: > Following up on this. > > > > *From: *"Jain, Ankit" <ankit.j...@here.com> > *Date: *Tuesday, May 16, 2017 at 12:14 AM > > *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" < > user@flink.apache.org> > *Subject: *Re: High Availability on Yarn > > > > Bringing it back to list’s focus. > > > > *From: *"Jain, Ankit" <ankit.j...@here.com> > *Date: *Thursday, May 11, 2017 at 1:19 PM > *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" < > user@flink.apache.org> > *Subject: *Re: High Availability on Yarn > > > > Got the answer on #2, looks like that will work, still looking for > suggestions on #1. > > > > Thanks > > Ankit > > > > *From: *"Jain, Ankit" <ankit.j...@here.com> > *Date: *Thursday, May 11, 2017 at 8:26 AM > *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" < > user@flink.apache.org> > *Subject: *Re: High Availability on Yarn > > > > Following up further on this. > > > > 1) We are using a long running EMR cluster to submit jobs right now > and as you know EMR hasn’t made Yarn ResourceManager HA. > > Is there any way we can use the information put in Zookeeper by Flink Job > Manager to bring the jobs back up on a new EMR cluster if RM goes down? > > > > We are not looking for completely automated option but maybe write a > script which reads Zookeeper and re-starts all jobs on a fresh EMR cluster? > > I am assuming if Yarn ResouceManager goes down, there is no way to just > bring it back up – you have to start a new EMR cluster? > > > > 2) Regarding elasticity, I know for now a running flink cluster > can’t make use of new hosts added to EMR but can I am guessing Yarn will > still see the new hosts and new flink jobs can make use it, is that right? > > > > > > Thanks > > Ankit > > > > *From: *"Jain, Ankit" <ankit.j...@here.com> > *Date: *Monday, May 8, 2017 at 9:09 AM > *To: *Stephan Ewen <se...@apache.org>, "user@flink.apache.org" < > user@flink.apache.org> > *Subject: *Re: High Availability on Yarn > > > > Thanks Stephan – we will go with a central ZooKeeper Instance and > hopefully have it started through a cloudformation script as part of EMR > startup. > > > > Is Zk also used to keep track of checkpoint metadata and the execution > graph of the running job to recover from ApplicationMaster failure as > Aljoscha was guessing below or only for leader election in case of > accidently running multiple Application Masters ? > > > > Thanks > > Ankit > > > > *From: *Stephan Ewen <se...@apache.org> > *Date: *Monday, May 8, 2017 at 9:00 AM > *To: *"user@flink.apache.org" <user@flink.apache.org>, "Jain, Ankit" < > ankit.j...@here.com> > *Subject: *Re: High Availability on Yarn > > > > @Ankit: > > > > ZooKeeper is required in YARN setups still. Even if there is only one > JobManager in the normal case, Yarn can accidentally create a second one > when there is a network partition. > > To prevent that this leads to inconsistencies, we use ZooKeeper. > > > > Flink uses ZooKeeper very little, so you can just let Flink attach to any > existing ZooKeeper, or user one ZooKeeper cluster for very many Flink > clusters/jobs. > > > > Stephan > > > > > > On Mon, May 8, 2017 at 2:11 PM, Aljoscha Krettek <aljos...@apache.org> > wrote: > > Hi, > > Yes, it’s recommended to use one ZooKeeper cluster for all Flink clusters. > > > > Best, > > Aljoscha > > > > On 5. May 2017, at 16:56, Jain, Ankit <ankit.j...@here.com> wrote: > > > > Thanks for the update Aljoscha. > > > > @Till Rohrmann <trohrm...@apache.org>, > > Can you please chim in? > > > > Also, we currently have a long running EMR cluster where we create one > flink cluster per job – can we just choose to install Zookeeper when > creating the EMR cluster and use one Zookeeper instance for ALL of flink > jobs? > > Or > > Recommendation is to have a dedicated Zookeeper instance per flink job? > > > > Thanks > > Ankit > > > > *From: *Aljoscha Krettek <aljos...@apache.org> > *Date: *Thursday, May 4, 2017 at 1:19 AM > *To: *"Jain, Ankit" <ankit.j...@here.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org>, Till Rohrmann < > trohrm...@apache.org> > *Subject: *Re: High Availability on Yarn > > > > Hi, > > Yes, for YARN there is only one running JobManager. As far as I Know, In > this case ZooKeeper is only used to keep track of checkpoint metadata and > the execution graph of the running job. Such that a restoring JobManager > can pick up the data again. > > > > I’m not 100 % sure on this, though, so maybe Till can shed some light on > this. > > > > Best, > > Aljoscha > > On 3. May 2017, at 16:58, Jain, Ankit <ankit.j...@here.com> wrote: > > > > Thanks for your reply Aljoscha. > > > > After building better understanding of Yarn and spending copious amount of > time on Flink codebase, I think I now get how Flink & Yarn interact – I > plan to document this soon in case it could help somebody starting afresh > with Flink-Yarn. > > > > Regarding Zookeper, in YARN mode there is only one JobManager running, do > we still need leader election? > > > > If the ApplicationMaster goes down (where JM runs) it is restarted by Yarn > RM and while restarting, Flink AM will bring back previous running > containers. So, where does Zookeeper sit in this setup? > > > > Thanks > > Ankit > > > > *From: *Aljoscha Krettek <aljos...@apache.org> > *Date: *Wednesday, May 3, 2017 at 2:05 AM > *To: *"Jain, Ankit" <ankit.j...@here.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org>, Till Rohrmann < > trohrm...@apache.org> > *Subject: *Re: High Availability on Yarn > > > > Hi, > > As a first comment, the work mentioned in the FLIP-6 doc you linked is > still work-in-progress. You cannot use these abstractions yet without going > into the code and setting up a cluster “by hand”. > > > > The documentation for one-step deployment of a Job to YARN is available > here: https://ci.apache.org/projects/flink/flink-docs- > release-1.2/setup/yarn_setup.html#run-a-single-flink-job-on-yarn > <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.2%2Fsetup%2Fyarn_setup.html%23run-a-single-flink-job-on-yarn&data=01%7C01%7C%7Cfb13823970ba4476ebbf08d49203846c%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=fzBiQtv7MR2%2Fehg6GepwPa1uWxpqEgPJakto2B8k0Zk%3D&reserved=0> > > > > Regarding your third question, ZooKeeper is mostly used for discovery and > leader election. That is, JobManagers use it to decide who is the main JM > and who are standby JMs. TaskManagers use it to discover the leading > JobManager that they should connect to. > > > > I’m also cc’ing Till, who should know this stuff better and can maybe > explain it in a bit more detail. > > > > Best, > > Aljoscha > > On 1. May 2017, at 18:59, Jain, Ankit <ankit.j...@here.com> wrote: > > > > Hi fellow users, > > We are trying to straighten out high availability story for flink. > > > > Our setup includes a long running EMR cluster, job submission is a > two-step process – 1) Flink cluster is first created using flink yarn > client on the EMR cluster already running 2) Flink job is submitted. > > > > I also saw references that with 1.2, these two steps have been combined > into 1 – is that change in FlinkYarnSessionCli.java? Can somebody point to > documentation please? > > > > W/o worrying about Yarn RM (not Flink Yarn RM that seems to be newly > introduced) failure for now, I want to understand first how task manager & > job manager failures are handled. > > > > My questions- > > 1) https://cwiki.apache.org/confluence/pages/viewpage. > action?pageId=65147077 > <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fpages%2Fviewpage.action%3FpageId%3D65147077&data=01%7C01%7C%7Cfb13823970ba4476ebbf08d49203846c%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=29Se7mWQZ09ukF3rkQNmSRPXY4RkA8RCNO4ec4Glj8I%3D&reserved=0> > suggests a new RM has been added and now there is one JobManager for > each job. Since Yarn RM will now talk to Flink RM( instead of JobManager > previously), will Yarn automatically restart failing Flink RM? > > 2) Is there any documentation on behavior of new Flink RM that will > come up? How will previously running JobManagers & TaskManagers find out > about new RM? > > 3) https://ci.apache.org/projects/flink/flink-docs- > release-1.3/setup/jobmanager_high_availability.html#configuration > <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.3%2Fsetup%2Fjobmanager_high_availability.html%23configuration&data=01%7C01%7C%7Cfb13823970ba4476ebbf08d49203846c%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=nYTYWaaWA4T1D7EwvL%2B7mwhrVcqn6xTzCv8SS6x%2FqLM%3D&reserved=0> > requires configuring Zookeeper even for Yarn – Is this needed for > handling Task Manager failures or JM or both? Will Yarn not take care of JM > failures? > > > > It may sound like I am little confused between role of Yarn and Flink > components– who has the most burden of HA? Documentation in current state > is lacking clarity – I know it is still evolving. > > > > Please let me know if somebody can help clear the confusion. > > > > Thanks > > Ankit > > > > > > > > > > > > >