Re: YARN High Availability

Maximilian Michels Thu, 19 Nov 2015 04:37:19 -0800

The docs have been updated.


On Thu, Nov 19, 2015 at 12:36 PM, Ufuk Celebi <u...@apache.org> wrote:
> I’ve added a note about this to the docs and asked Max to trigger a new build 
> of them.
>
> Regarding Aljoscha’s idea: I like it. It is essentially a shortcut for 
> configuring the root path.
>
> In any case, it is orthogonal to Till’s proposals. That one we need to 
> address as well (see FLINK-2929). The motivation for the current behaviour 
> was to be rather defensive when removing state in order to not loose data 
> accidentally. But it can be confusing, indeed.
>
> – Ufuk
>
>> On 19 Nov 2015, at 12:08, Till Rohrmann <trohrm...@apache.org> wrote:
>>
>> You mean an additional start-up parameter for the `start-cluster.sh` script 
>> for the HA case? That could work.
>>
>> On Thu, Nov 19, 2015 at 11:54 AM, Aljoscha Krettek <aljos...@apache.org> 
>> wrote:
>> Maybe we could add a user parameter to specify a cluster name that is used 
>> to make the paths unique.
>>
>>
>> On Thu, Nov 19, 2015, 11:24 Till Rohrmann <trohrm...@apache.org> wrote:
>> I agree that this would make the configuration easier. However, it entails 
>> also that the user has to retrieve the randomized path from the logs if he 
>> wants to restart jobs after the cluster has crashed or intentionally 
>> restarted. Furthermore, the system won't be able to clean up old checkpoint 
>> and job handles in case that the cluster stop was intentional.
>>
>> Thus, the question is how do we define the behaviour in order to retrieve 
>> handles and to clean up old handles so that ZooKeeper won't be cluttered 
>> with old handles?
>>
>> There are basically two modes:
>>
>> 1. Keep state handles when shutting down the cluster. Provide a mean to 
>> define a fixed path when starting the cluster and also a mean to purge old 
>> state handles. Furthermore, add a shutdown mode where the handles under the 
>> current path are directly removed. This mode would guarantee to always have 
>> the state handles available if not explicitly told differently. However, the 
>> downside is that ZooKeeper will be cluttered most certainly.
>>
>> 2. Remove the state handles when shutting down the cluster. Provide a 
>> shutdown mode where we keep the state handles. This will keep ZooKeeper 
>> clean but will give you also the possibility to keep a checkpoint around if 
>> necessary. However, the user is more likely to lose his state when shutting 
>> down the cluster.
>>
>> On Thu, Nov 19, 2015 at 10:55 AM, Robert Metzger <rmetz...@apache.org> wrote:
>> I agree with Aljoscha. Many companies install Flink (and its config) in a 
>> central directory and users share that installation.
>>
>> On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljos...@apache.org> 
>> wrote:
>> I think we should find a way to randomize the paths where the HA stuff 
>> stores data. If users don’t realize that they store data in the same paths 
>> this could lead to problems.
>>
>> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrm...@apache.org> wrote:
>> >
>> > Hi Gwenhaël,
>> >
>> > good to hear that you could resolve the problem.
>> >
>> > When you run multiple HA flink jobs in the same cluster, then you don’t 
>> > have to adjust the configuration of Flink. It should work out of the box.
>> >
>> > However, if you run multiple HA Flink cluster, then you have to set for 
>> > each cluster a distinct ZooKeeper root path via the option 
>> > recovery.zookeeper.path.root in the Flink configuraiton. This is necessary 
>> > because otherwise all JobManagers (the ones of the different clusters) 
>> > will compete for a single leadership. Furthermore, all TaskManagers will 
>> > only see the one and only leader and connect to it. The reason is that the 
>> > TaskManagers will look up their leader at a ZNode below the ZooKeeper root 
>> > path.
>> >
>> > If you have other questions then don’t hesitate asking me.
>> >
>> > Cheers,
>> > Till
>> >
>> >
>> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers 
>> > <gwenhael.pasqui...@ericsson.com> wrote:
>> > Nevermind,
>> >
>> >
>> >
>> > Looking at the logs I saw that it was having issues trying to connect to 
>> > ZK.
>> >
>> > To make I short is had the wrong port.
>> >
>> >
>> >
>> > It is now starting.
>> >
>> >
>> >
>> > Tomorrow I’ll try to kill some JobManagers *evil*.
>> >
>> >
>> >
>> > Another question : if I have multiple HA flink jobs, are there some points 
>> > to check in order to be sure that they won’t collide on hdfs or ZK ?
>> >
>> >
>> >
>> > B.R.
>> >
>> >
>> >
>> > Gwenhaël PASQUIERS
>> >
>> >
>> >
>> > From: Till Rohrmann [mailto:till.rohrm...@gmail.com]
>> > Sent: mercredi 18 novembre 2015 18:01
>> > To: user@flink.apache.org
>> > Subject: Re: YARN High Availability
>> >
>> >
>> >
>> > Hi Gwenhaël,
>> >
>> >
>> >
>> > do you have access to the yarn logs?
>> >
>> >
>> >
>> > Cheers,
>> >
>> > Till
>> >
>> >
>> >
>> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers 
>> > <gwenhael.pasqui...@ericsson.com> wrote:
>> >
>> > Hello,
>> >
>> >
>> >
>> > We’re trying to set up high availability using an existing zookeeper 
>> > quorum already running in our Cloudera cluster.
>> >
>> >
>> >
>> > So, as per the doc we’ve changed the max attempt in yarn’s config as well 
>> > as the flink.yaml.
>> >
>> >
>> >
>> > recovery.mode: zookeeper
>> >
>> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
>> >
>> > state.backend: filesystem
>> >
>> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
>> >
>> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/
>> >
>> > yarn.application-attempts: 1000
>> >
>> >
>> >
>> > Everything is ok as long as recovery.mode is commented.
>> >
>> > As soon as I uncomment recovery.mode the deployment on yarn is stuck on :
>> >
>> >
>> >
>> > “Deploying cluster, current state ACCEPTED”.
>> >
>> > “Deployment took more than 60 seconds….”
>> >
>> > Every second.
>> >
>> >
>> >
>> > And I have more than enough resources available on my yarn cluster.
>> >
>> >
>> >
>> > Do you have any idea of what could cause this, and/or what logs I should 
>> > look for in order to understand ?
>> >
>> >
>> >
>> > B.R.
>> >
>> >
>> >
>> > Gwenhaël PASQUIERS
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>

Re: YARN High Availability

Reply via email to