I think we should find a way to randomize the paths where the HA stuff stores 
data. If users don’t realize that they store data in the same paths this could 
lead to problems.

> On 19 Nov 2015, at 08:50, Till Rohrmann <trohrm...@apache.org> wrote:
> 
> Hi Gwenhaël,
> 
> good to hear that you could resolve the problem.
> 
> When you run multiple HA flink jobs in the same cluster, then you don’t have 
> to adjust the configuration of Flink. It should work out of the box.
> 
> However, if you run multiple HA Flink cluster, then you have to set for each 
> cluster a distinct ZooKeeper root path via the option 
> recovery.zookeeper.path.root in the Flink configuraiton. This is necessary 
> because otherwise all JobManagers (the ones of the different clusters) will 
> compete for a single leadership. Furthermore, all TaskManagers will only see 
> the one and only leader and connect to it. The reason is that the 
> TaskManagers will look up their leader at a ZNode below the ZooKeeper root 
> path.
> 
> If you have other questions then don’t hesitate asking me.
> 
> Cheers,
> Till
> 
> 
> On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers 
> <gwenhael.pasqui...@ericsson.com> wrote:
> Nevermind,
> 
>  
> 
> Looking at the logs I saw that it was having issues trying to connect to ZK.
> 
> To make I short is had the wrong port.
> 
>  
> 
> It is now starting.
> 
>  
> 
> Tomorrow I’ll try to kill some JobManagers *evil*.
> 
>  
> 
> Another question : if I have multiple HA flink jobs, are there some points to 
> check in order to be sure that they won’t collide on hdfs or ZK ?
> 
>  
> 
> B.R.
> 
>  
> 
> Gwenhaël PASQUIERS
> 
>  
> 
> From: Till Rohrmann [mailto:till.rohrm...@gmail.com] 
> Sent: mercredi 18 novembre 2015 18:01
> To: user@flink.apache.org
> Subject: Re: YARN High Availability
> 
>  
> 
> Hi Gwenhaël,
> 
>  
> 
> do you have access to the yarn logs?
> 
>  
> 
> Cheers,
> 
> Till
> 
>  
> 
> On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers 
> <gwenhael.pasqui...@ericsson.com> wrote:
> 
> Hello,
> 
>  
> 
> We’re trying to set up high availability using an existing zookeeper quorum 
> already running in our Cloudera cluster.
> 
>  
> 
> So, as per the doc we’ve changed the max attempt in yarn’s config as well as 
> the flink.yaml.
> 
>  
> 
> recovery.mode: zookeeper
> 
> recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
> 
> state.backend: filesystem
> 
> state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
> 
> recovery.zookeeper.storageDir: hdfs:///flink/recovery/
> 
> yarn.application-attempts: 1000
> 
>  
> 
> Everything is ok as long as recovery.mode is commented.
> 
> As soon as I uncomment recovery.mode the deployment on yarn is stuck on :
> 
>  
> 
> “Deploying cluster, current state ACCEPTED”.
> 
> “Deployment took more than 60 seconds….”
> 
> Every second.
> 
>  
> 
> And I have more than enough resources available on my yarn cluster.
> 
>  
> 
> Do you have any idea of what could cause this, and/or what logs I should look 
> for in order to understand ?
> 
>  
> 
> B.R.
> 
>  
> 
> Gwenhaël PASQUIERS
> 
>  
> 
> 

Reply via email to