The docs have been updated.
On Thu, Nov 19, 2015 at 12:36 PM, Ufuk Celebi <u...@apache.org> wrote: > I’ve added a note about this to the docs and asked Max to trigger a new build > of them. > > Regarding Aljoscha’s idea: I like it. It is essentially a shortcut for > configuring the root path. > > In any case, it is orthogonal to Till’s proposals. That one we need to > address as well (see FLINK-2929). The motivation for the current behaviour > was to be rather defensive when removing state in order to not loose data > accidentally. But it can be confusing, indeed. > > – Ufuk > >> On 19 Nov 2015, at 12:08, Till Rohrmann <trohrm...@apache.org> wrote: >> >> You mean an additional start-up parameter for the `start-cluster.sh` script >> for the HA case? That could work. >> >> On Thu, Nov 19, 2015 at 11:54 AM, Aljoscha Krettek <aljos...@apache.org> >> wrote: >> Maybe we could add a user parameter to specify a cluster name that is used >> to make the paths unique. >> >> >> On Thu, Nov 19, 2015, 11:24 Till Rohrmann <trohrm...@apache.org> wrote: >> I agree that this would make the configuration easier. However, it entails >> also that the user has to retrieve the randomized path from the logs if he >> wants to restart jobs after the cluster has crashed or intentionally >> restarted. Furthermore, the system won't be able to clean up old checkpoint >> and job handles in case that the cluster stop was intentional. >> >> Thus, the question is how do we define the behaviour in order to retrieve >> handles and to clean up old handles so that ZooKeeper won't be cluttered >> with old handles? >> >> There are basically two modes: >> >> 1. Keep state handles when shutting down the cluster. Provide a mean to >> define a fixed path when starting the cluster and also a mean to purge old >> state handles. Furthermore, add a shutdown mode where the handles under the >> current path are directly removed. This mode would guarantee to always have >> the state handles available if not explicitly told differently. However, the >> downside is that ZooKeeper will be cluttered most certainly. >> >> 2. Remove the state handles when shutting down the cluster. Provide a >> shutdown mode where we keep the state handles. This will keep ZooKeeper >> clean but will give you also the possibility to keep a checkpoint around if >> necessary. However, the user is more likely to lose his state when shutting >> down the cluster. >> >> On Thu, Nov 19, 2015 at 10:55 AM, Robert Metzger <rmetz...@apache.org> wrote: >> I agree with Aljoscha. Many companies install Flink (and its config) in a >> central directory and users share that installation. >> >> On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljos...@apache.org> >> wrote: >> I think we should find a way to randomize the paths where the HA stuff >> stores data. If users don’t realize that they store data in the same paths >> this could lead to problems. >> >> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrm...@apache.org> wrote: >> > >> > Hi Gwenhaël, >> > >> > good to hear that you could resolve the problem. >> > >> > When you run multiple HA flink jobs in the same cluster, then you don’t >> > have to adjust the configuration of Flink. It should work out of the box. >> > >> > However, if you run multiple HA Flink cluster, then you have to set for >> > each cluster a distinct ZooKeeper root path via the option >> > recovery.zookeeper.path.root in the Flink configuraiton. This is necessary >> > because otherwise all JobManagers (the ones of the different clusters) >> > will compete for a single leadership. Furthermore, all TaskManagers will >> > only see the one and only leader and connect to it. The reason is that the >> > TaskManagers will look up their leader at a ZNode below the ZooKeeper root >> > path. >> > >> > If you have other questions then don’t hesitate asking me. >> > >> > Cheers, >> > Till >> > >> > >> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers >> > <gwenhael.pasqui...@ericsson.com> wrote: >> > Nevermind, >> > >> > >> > >> > Looking at the logs I saw that it was having issues trying to connect to >> > ZK. >> > >> > To make I short is had the wrong port. >> > >> > >> > >> > It is now starting. >> > >> > >> > >> > Tomorrow I’ll try to kill some JobManagers *evil*. >> > >> > >> > >> > Another question : if I have multiple HA flink jobs, are there some points >> > to check in order to be sure that they won’t collide on hdfs or ZK ? >> > >> > >> > >> > B.R. >> > >> > >> > >> > Gwenhaël PASQUIERS >> > >> > >> > >> > From: Till Rohrmann [mailto:till.rohrm...@gmail.com] >> > Sent: mercredi 18 novembre 2015 18:01 >> > To: user@flink.apache.org >> > Subject: Re: YARN High Availability >> > >> > >> > >> > Hi Gwenhaël, >> > >> > >> > >> > do you have access to the yarn logs? >> > >> > >> > >> > Cheers, >> > >> > Till >> > >> > >> > >> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers >> > <gwenhael.pasqui...@ericsson.com> wrote: >> > >> > Hello, >> > >> > >> > >> > We’re trying to set up high availability using an existing zookeeper >> > quorum already running in our Cloudera cluster. >> > >> > >> > >> > So, as per the doc we’ve changed the max attempt in yarn’s config as well >> > as the flink.yaml. >> > >> > >> > >> > recovery.mode: zookeeper >> > >> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181 >> > >> > state.backend: filesystem >> > >> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints >> > >> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/ >> > >> > yarn.application-attempts: 1000 >> > >> > >> > >> > Everything is ok as long as recovery.mode is commented. >> > >> > As soon as I uncomment recovery.mode the deployment on yarn is stuck on : >> > >> > >> > >> > “Deploying cluster, current state ACCEPTED”. >> > >> > “Deployment took more than 60 seconds….” >> > >> > Every second. >> > >> > >> > >> > And I have more than enough resources available on my yarn cluster. >> > >> > >> > >> > Do you have any idea of what could cause this, and/or what logs I should >> > look for in order to understand ? >> > >> > >> > >> > B.R. >> > >> > >> > >> > Gwenhaël PASQUIERS >> > >> > >> > >> > >> >> >> >> >