Hi Stefano, The Job should stop temporarily but then be resumed by the new JobManager. Have you increased the number of execution retries? AFAIK, it is set to 0 by default. This will not re-run the job, even in HA mode. You can enable it on the StreamExecutionEnvironment.
Otherwise, you have probably already found the documentation: https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration Cheers, Max On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino <stefano.bagh...@radicalbit.io> wrote: > Hello everyone, > > last week I've ran some tests with Apache ZooKeeper to get a grip on Flink > HA features. My tests went bad so far and I can't sort out the reason. > > My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3 > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6) ensemble. > I've started ZooKeeper on each machine, tested it's availability and then > started the Flink cluster. Since there's no reliable distributed filesystem > on the cluster, I had to use the local file system as the state backend. > > I then submitted a very simple streaming job that writes the timestamp on a > text file on the local file system each second and then went on to kill the > process running the job manager to verify that another job manager takes > over. However, the job just stopped. I still have to perform some checks on > the handover to the new job manager, but before digging deeper I wanted to > ask if my expectation of having the job going despite the job manager > failure is unreasonable. > > Thanks in advance. > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit