Re: Issues testing Flink HA w/ ZooKeeper

2016-02-16 Thread Stephan Ewen
Hi! As a bit of background: ZooKeeper allows you only to store very small data. We hence persist only the changing checkpoint metadata in ZooKeeper. To recover a job, some constant data is also needed: The JobGraph, and the JarFiles. These cannot go to ZooKeeper, but need to go to a reliable stor

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-16 Thread Stefano Baghino
Ok, simply turning up HDFS on the cluster and using it as the state backend fixed the issue. Thank you both for the help! On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino < stefano.bagh...@radicalbit.io> wrote: > You can find the log of the recovering job manager here: > https://gist.github.com/s

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Stefano Baghino
You can find the log of the recovering job manager here: https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42 Basically, what Ufuk said happened: the job manager tried to fill in for the lost one but couldn't find the actual data because it looked it up locally whereas due to my configurati

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Maximilian Michels
Hi Stefano, A correction from my side: You don't need to set the execution retries for HA because a new JobManager will automatically take over and resubmit all jobs which were recovered from the storage directory you set up. The number of execution retries applies only to jobs which are restarted

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Ufuk Celebi
> On 15 Feb 2016, at 13:40, Stefano Baghino > wrote: > > Hi Ufuk, thanks for replying. > > Regarding the masters file: yes, I've specified all the masters and checked > out that they were actually running after the start-cluster.sh. I'll gladly > share the logs as soon as I get to see them.

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Maximilian Michels
Hi Stefano, That is true. The documentation doesn't mention that. Just wanted to point you to the documentation if anything else needs to be configured. We will update it. Instead of setting the number of execution retries on the StreamExecutionEnvironment, you may also set "execution-retries.def

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Stefano Baghino
Hi Maximilian, thank you for the reply. I've checked out the documentation before running my tests (I'm not expert enough to not read the docs ;)) but it doesn't mention some specific requirement regarding the execution retries, I'll check it out, thank! On Mon, Feb 15, 2016 at 12:51 PM, Maximili

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Stefano Baghino
Hi Ufuk, thanks for replying. Regarding the masters file: yes, I've specified all the masters and checked out that they were actually running after the start-cluster.sh. I'll gladly share the logs as soon as I get to see them. Regarding the state backend: how does having a non-distributed storage

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Maximilian Michels
Hi Stefano, The Job should stop temporarily but then be resumed by the new JobManager. Have you increased the number of execution retries? AFAIK, it is set to 0 by default. This will not re-run the job, even in HA mode. You can enable it on the StreamExecutionEnvironment. Otherwise, you have prob

Re: Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Ufuk Celebi
Using the local file system as state backend only works if all job managers run on the same machine. Is that the case? Have you specified all job managers in the masters file? With the local file system state backend only something like host-X host-X host-X will be a valid masters configuration.

Issues testing Flink HA w/ ZooKeeper

2016-02-15 Thread Stefano Baghino
Hello everyone, last week I've ran some tests with Apache ZooKeeper to get a grip on Flink HA features. My tests went bad so far and I can't sort out the reason. My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3 masters and 4 slaves. The 3 masters are also the ZooKeeper (3