Hi! As a bit of background: ZooKeeper allows you only to store very small data. We hence persist only the changing checkpoint metadata in ZooKeeper.
To recover a job, some constant data is also needed: The JobGraph, and the JarFiles. These cannot go to ZooKeeper, but need to go to a reliable storage (such as HDFS, S3, or a mounted file system). Greetings, Stephan On Tue, Feb 16, 2016 at 11:30 AM, Stefano Baghino < stefano.bagh...@radicalbit.io> wrote: > Ok, simply turning up HDFS on the cluster and using it as the state > backend fixed the issue. Thank you both for the help! > > On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino < > stefano.bagh...@radicalbit.io> wrote: > >> You can find the log of the recovering job manager here: >> https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42 >> >> Basically, what Ufuk said happened: the job manager tried to fill in for >> the lost one but couldn't find the actual data because it looked it up >> locally whereas due to my configuration it was actually stored on another >> machine. >> >> Thanks for the help, it's really been precious! >> >> On Mon, Feb 15, 2016 at 5:24 PM, Maximilian Michels <m...@apache.org> >> wrote: >> >>> Hi Stefano, >>> >>> A correction from my side: You don't need to set the execution retries >>> for HA because a new JobManager will automatically take over and >>> resubmit all jobs which were recovered from the storage directory you >>> set up. The number of execution retries applies only to jobs which are >>> restarted due to a TaskManager failure. >>> >>> It would be great if you could supply some logs. >>> >>> Cheers, >>> Max >>> >>> >>> On Mon, Feb 15, 2016 at 1:45 PM, Maximilian Michels <m...@apache.org> >>> wrote: >>> > Hi Stefano, >>> > >>> > That is true. The documentation doesn't mention that. Just wanted to >>> > point you to the documentation if anything else needs to be >>> > configured. We will update it. >>> > >>> > Instead of setting the number of execution retries on the >>> > StreamExecutionEnvironment, you may also set >>> > "execution-retries.default" in the flink-conf.yaml. Let us know if >>> > that fixes your setup. >>> > >>> > Cheers, >>> > Max >>> > >>> > On Mon, Feb 15, 2016 at 1:41 PM, Stefano Baghino >>> > <stefano.bagh...@radicalbit.io> wrote: >>> >> Hi Maximilian, >>> >> >>> >> thank you for the reply. I've checked out the documentation before >>> running >>> >> my tests (I'm not expert enough to not read the docs ;)) but it >>> doesn't >>> >> mention some specific requirement regarding the execution retries, >>> I'll >>> >> check it out, thank! >>> >> >>> >> On Mon, Feb 15, 2016 at 12:51 PM, Maximilian Michels <m...@apache.org> >>> wrote: >>> >>> >>> >>> Hi Stefano, >>> >>> >>> >>> The Job should stop temporarily but then be resumed by the new >>> >>> JobManager. Have you increased the number of execution retries? >>> AFAIK, >>> >>> it is set to 0 by default. This will not re-run the job, even in HA >>> >>> mode. You can enable it on the StreamExecutionEnvironment. >>> >>> >>> >>> Otherwise, you have probably already found the documentation: >>> >>> >>> >>> >>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration >>> >>> >>> >>> Cheers, >>> >>> Max >>> >>> >>> >>> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino >>> >>> <stefano.bagh...@radicalbit.io> wrote: >>> >>> > Hello everyone, >>> >>> > >>> >>> > last week I've ran some tests with Apache ZooKeeper to get a grip >>> on >>> >>> > Flink >>> >>> > HA features. My tests went bad so far and I can't sort out the >>> reason. >>> >>> > >>> >>> > My latest tests involved Flink 0.10.2, ran as a standalone cluster >>> with >>> >>> > 3 >>> >>> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6) >>> >>> > ensemble. >>> >>> > I've started ZooKeeper on each machine, tested it's availability >>> and >>> >>> > then >>> >>> > started the Flink cluster. Since there's no reliable distributed >>> >>> > filesystem >>> >>> > on the cluster, I had to use the local file system as the state >>> backend. >>> >>> > >>> >>> > I then submitted a very simple streaming job that writes the >>> timestamp >>> >>> > on a >>> >>> > text file on the local file system each second and then went on to >>> kill >>> >>> > the >>> >>> > process running the job manager to verify that another job manager >>> takes >>> >>> > over. However, the job just stopped. I still have to perform some >>> checks >>> >>> > on >>> >>> > the handover to the new job manager, but before digging deeper I >>> wanted >>> >>> > to >>> >>> > ask if my expectation of having the job going despite the job >>> manager >>> >>> > failure is unreasonable. >>> >>> > >>> >>> > Thanks in advance. >>> >>> > >>> >>> > -- >>> >>> > BR, >>> >>> > Stefano Baghino >>> >>> > >>> >>> > Software Engineer @ Radicalbit >>> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> BR, >>> >> Stefano Baghino >>> >> >>> >> Software Engineer @ Radicalbit >>> >> >> >> >> -- >> BR, >> Stefano Baghino >> >> Software Engineer @ Radicalbit >> > > > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit >