Hello everyone,

last week I've ran some tests with Apache ZooKeeper to get a grip on Flink
HA features. My tests went bad so far and I can't sort out the reason.

My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3
masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
ensemble. I've started ZooKeeper on each machine, tested it's availability
and then started the Flink cluster. Since there's no reliable distributed
filesystem on the cluster, I had to use the local file system as the state
backend.

I then submitted a very simple streaming job that writes the timestamp on a
text file on the local file system each second and then went on to kill the
process running the job manager to verify that another job manager takes
over. However, the job just stopped. I still have to perform some checks on
the handover to the new job manager, but before digging deeper I wanted to
ask if my expectation of having the job going despite the job manager
failure is unreasonable.

Thanks in advance.

-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Reply via email to