I think we should find a way to randomize the paths where the HA stuff stores data. If users don’t realize that they store data in the same paths this could lead to problems.
> On 19 Nov 2015, at 08:50, Till Rohrmann <trohrm...@apache.org> wrote: > > Hi Gwenhaël, > > good to hear that you could resolve the problem. > > When you run multiple HA flink jobs in the same cluster, then you don’t have > to adjust the configuration of Flink. It should work out of the box. > > However, if you run multiple HA Flink cluster, then you have to set for each > cluster a distinct ZooKeeper root path via the option > recovery.zookeeper.path.root in the Flink configuraiton. This is necessary > because otherwise all JobManagers (the ones of the different clusters) will > compete for a single leadership. Furthermore, all TaskManagers will only see > the one and only leader and connect to it. The reason is that the > TaskManagers will look up their leader at a ZNode below the ZooKeeper root > path. > > If you have other questions then don’t hesitate asking me. > > Cheers, > Till > > > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: > Nevermind, > > > > Looking at the logs I saw that it was having issues trying to connect to ZK. > > To make I short is had the wrong port. > > > > It is now starting. > > > > Tomorrow I’ll try to kill some JobManagers *evil*. > > > > Another question : if I have multiple HA flink jobs, are there some points to > check in order to be sure that they won’t collide on hdfs or ZK ? > > > > B.R. > > > > Gwenhaël PASQUIERS > > > > From: Till Rohrmann [mailto:till.rohrm...@gmail.com] > Sent: mercredi 18 novembre 2015 18:01 > To: user@flink.apache.org > Subject: Re: YARN High Availability > > > > Hi Gwenhaël, > > > > do you have access to the yarn logs? > > > > Cheers, > > Till > > > > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: > > Hello, > > > > We’re trying to set up high availability using an existing zookeeper quorum > already running in our Cloudera cluster. > > > > So, as per the doc we’ve changed the max attempt in yarn’s config as well as > the flink.yaml. > > > > recovery.mode: zookeeper > > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181 > > state.backend: filesystem > > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints > > recovery.zookeeper.storageDir: hdfs:///flink/recovery/ > > yarn.application-attempts: 1000 > > > > Everything is ok as long as recovery.mode is commented. > > As soon as I uncomment recovery.mode the deployment on yarn is stuck on : > > > > “Deploying cluster, current state ACCEPTED”. > > “Deployment took more than 60 seconds….” > > Every second. > > > > And I have more than enough resources available on my yarn cluster. > > > > Do you have any idea of what could cause this, and/or what logs I should look > for in order to understand ? > > > > B.R. > > > > Gwenhaël PASQUIERS > > > >