Hi Stefano,

The Job should stop temporarily but then be resumed by the new
JobManager. Have you increased the number of execution retries? AFAIK,
it is set to 0 by default. This will not re-run the job, even in HA
mode. You can enable it on the StreamExecutionEnvironment.

Otherwise, you have probably already found the documentation:
https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration

Cheers,
Max

On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
<stefano.bagh...@radicalbit.io> wrote:
> Hello everyone,
>
> last week I've ran some tests with Apache ZooKeeper to get a grip on Flink
> HA features. My tests went bad so far and I can't sort out the reason.
>
> My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3
> masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6) ensemble.
> I've started ZooKeeper on each machine, tested it's availability and then
> started the Flink cluster. Since there's no reliable distributed filesystem
> on the cluster, I had to use the local file system as the state backend.
>
> I then submitted a very simple streaming job that writes the timestamp on a
> text file on the local file system each second and then went on to kill the
> process running the job manager to verify that another job manager takes
> over. However, the job just stopped. I still have to perform some checks on
> the handover to the new job manager, but before digging deeper I wanted to
> ask if my expectation of having the job going despite the job manager
> failure is unreasonable.
>
> Thanks in advance.
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit

Reply via email to