Hello everyone, last week I've ran some tests with Apache ZooKeeper to get a grip on Flink HA features. My tests went bad so far and I can't sort out the reason.
My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3 masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6) ensemble. I've started ZooKeeper on each machine, tested it's availability and then started the Flink cluster. Since there's no reliable distributed filesystem on the cluster, I had to use the local file system as the state backend. I then submitted a very simple streaming job that writes the timestamp on a text file on the local file system each second and then went on to kill the process running the job manager to verify that another job manager takes over. However, the job just stopped. I still have to perform some checks on the handover to the new job manager, but before digging deeper I wanted to ask if my expectation of having the job going despite the job manager failure is unreasonable. Thanks in advance. -- BR, Stefano Baghino Software Engineer @ Radicalbit