You mean an additional start-up parameter for the `start-cluster.sh` script for the HA case? That could work.
On Thu, Nov 19, 2015 at 11:54 AM, Aljoscha Krettek <aljos...@apache.org> wrote: > Maybe we could add a user parameter to specify a cluster name that is used > to make the paths unique. > > On Thu, Nov 19, 2015, 11:24 Till Rohrmann <trohrm...@apache.org> wrote: > >> I agree that this would make the configuration easier. However, it >> entails also that the user has to retrieve the randomized path from the >> logs if he wants to restart jobs after the cluster has crashed or >> intentionally restarted. Furthermore, the system won't be able to clean up >> old checkpoint and job handles in case that the cluster stop was >> intentional. >> >> Thus, the question is how do we define the behaviour in order to retrieve >> handles and to clean up old handles so that ZooKeeper won't be cluttered >> with old handles? >> >> There are basically two modes: >> >> 1. Keep state handles when shutting down the cluster. Provide a mean to >> define a fixed path when starting the cluster and also a mean to purge old >> state handles. Furthermore, add a shutdown mode where the handles under the >> current path are directly removed. This mode would guarantee to always have >> the state handles available if not explicitly told differently. However, >> the downside is that ZooKeeper will be cluttered most certainly. >> >> 2. Remove the state handles when shutting down the cluster. Provide a >> shutdown mode where we keep the state handles. This will keep ZooKeeper >> clean but will give you also the possibility to keep a checkpoint around if >> necessary. However, the user is more likely to lose his state when shutting >> down the cluster. >> >> On Thu, Nov 19, 2015 at 10:55 AM, Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> I agree with Aljoscha. Many companies install Flink (and its config) in >>> a central directory and users share that installation. >>> >>> On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljos...@apache.org> >>> wrote: >>> >>>> I think we should find a way to randomize the paths where the HA stuff >>>> stores data. If users don’t realize that they store data in the same paths >>>> this could lead to problems. >>>> >>>> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrm...@apache.org> wrote: >>>> > >>>> > Hi Gwenhaël, >>>> > >>>> > good to hear that you could resolve the problem. >>>> > >>>> > When you run multiple HA flink jobs in the same cluster, then you >>>> don’t have to adjust the configuration of Flink. It should work out of the >>>> box. >>>> > >>>> > However, if you run multiple HA Flink cluster, then you have to set >>>> for each cluster a distinct ZooKeeper root path via the option >>>> recovery.zookeeper.path.root in the Flink configuraiton. This is necessary >>>> because otherwise all JobManagers (the ones of the different clusters) will >>>> compete for a single leadership. Furthermore, all TaskManagers will only >>>> see the one and only leader and connect to it. The reason is that the >>>> TaskManagers will look up their leader at a ZNode below the ZooKeeper root >>>> path. >>>> > >>>> > If you have other questions then don’t hesitate asking me. >>>> > >>>> > Cheers, >>>> > Till >>>> > >>>> > >>>> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers < >>>> gwenhael.pasqui...@ericsson.com> wrote: >>>> > Nevermind, >>>> > >>>> > >>>> > >>>> > Looking at the logs I saw that it was having issues trying to connect >>>> to ZK. >>>> > >>>> > To make I short is had the wrong port. >>>> > >>>> > >>>> > >>>> > It is now starting. >>>> > >>>> > >>>> > >>>> > Tomorrow I’ll try to kill some JobManagers *evil*. >>>> > >>>> > >>>> > >>>> > Another question : if I have multiple HA flink jobs, are there some >>>> points to check in order to be sure that they won’t collide on hdfs or ZK ? >>>> > >>>> > >>>> > >>>> > B.R. >>>> > >>>> > >>>> > >>>> > Gwenhaël PASQUIERS >>>> > >>>> > >>>> > >>>> > From: Till Rohrmann [mailto:till.rohrm...@gmail.com] >>>> > Sent: mercredi 18 novembre 2015 18:01 >>>> > To: user@flink.apache.org >>>> > Subject: Re: YARN High Availability >>>> > >>>> > >>>> > >>>> > Hi Gwenhaël, >>>> > >>>> > >>>> > >>>> > do you have access to the yarn logs? >>>> > >>>> > >>>> > >>>> > Cheers, >>>> > >>>> > Till >>>> > >>>> > >>>> > >>>> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers < >>>> gwenhael.pasqui...@ericsson.com> wrote: >>>> > >>>> > Hello, >>>> > >>>> > >>>> > >>>> > We’re trying to set up high availability using an existing zookeeper >>>> quorum already running in our Cloudera cluster. >>>> > >>>> > >>>> > >>>> > So, as per the doc we’ve changed the max attempt in yarn’s config as >>>> well as the flink.yaml. >>>> > >>>> > >>>> > >>>> > recovery.mode: zookeeper >>>> > >>>> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181 >>>> > >>>> > state.backend: filesystem >>>> > >>>> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints >>>> > >>>> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/ >>>> > >>>> > yarn.application-attempts: 1000 >>>> > >>>> > >>>> > >>>> > Everything is ok as long as recovery.mode is commented. >>>> > >>>> > As soon as I uncomment recovery.mode the deployment on yarn is stuck >>>> on : >>>> > >>>> > >>>> > >>>> > “Deploying cluster, current state ACCEPTED”. >>>> > >>>> > “Deployment took more than 60 seconds….” >>>> > >>>> > Every second. >>>> > >>>> > >>>> > >>>> > And I have more than enough resources available on my yarn cluster. >>>> > >>>> > >>>> > >>>> > Do you have any idea of what could cause this, and/or what logs I >>>> should look for in order to understand ? >>>> > >>>> > >>>> > >>>> > B.R. >>>> > >>>> > >>>> > >>>> > Gwenhaël PASQUIERS >>>> > >>>> > >>>> > >>>> > >>>> >>>> >>> >>