Using the local file system as state backend only works if all job managers run on the same machine. Is that the case?
Have you specified all job managers in the masters file? With the local file system state backend only something like host-X host-X host-X will be a valid masters configuration. Can you please share the job manager logs of all started job managers? – Ufuk On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino <stefano.bagh...@radicalbit.io> wrote: > Hello everyone, > > last week I've ran some tests with Apache ZooKeeper to get a grip on Flink > HA features. My tests went bad so far and I can't sort out the reason. > > My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3 > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6) ensemble. > I've started ZooKeeper on each machine, tested it's availability and then > started the Flink cluster. Since there's no reliable distributed filesystem > on the cluster, I had to use the local file system as the state backend. > > I then submitted a very simple streaming job that writes the timestamp on a > text file on the local file system each second and then went on to kill the > process running the job manager to verify that another job manager takes > over. However, the job just stopped. I still have to perform some checks on > the handover to the new job manager, but before digging deeper I wanted to > ask if my expectation of having the job going despite the job manager > failure is unreasonable. > > Thanks in advance. > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit