Ok, simply turning up HDFS on the cluster and using it as the state backend
fixed the issue. Thank you both for the help!

On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> You can find the log of the recovering job manager here:
> https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42
>
> Basically, what Ufuk said happened: the job manager tried to fill in for
> the lost one but couldn't find the actual data because it looked it up
> locally whereas due to my configuration it was actually stored on another
> machine.
>
> Thanks for the help, it's really been precious!
>
> On Mon, Feb 15, 2016 at 5:24 PM, Maximilian Michels <m...@apache.org>
> wrote:
>
>> Hi Stefano,
>>
>> A correction from my side: You don't need to set the execution retries
>> for HA because a new JobManager will automatically take over and
>> resubmit all jobs which were recovered from the storage directory you
>> set up. The number of execution retries applies only to jobs which are
>> restarted due to a TaskManager failure.
>>
>> It would be great if you could supply some logs.
>>
>> Cheers,
>> Max
>>
>>
>> On Mon, Feb 15, 2016 at 1:45 PM, Maximilian Michels <m...@apache.org>
>> wrote:
>> > Hi Stefano,
>> >
>> > That is true. The documentation doesn't mention that. Just wanted to
>> > point you to the documentation if anything else needs to be
>> > configured. We will update it.
>> >
>> > Instead of setting the number of execution retries on the
>> > StreamExecutionEnvironment, you may also set
>> > "execution-retries.default" in the flink-conf.yaml. Let us know if
>> > that fixes your setup.
>> >
>> > Cheers,
>> > Max
>> >
>> > On Mon, Feb 15, 2016 at 1:41 PM, Stefano Baghino
>> > <stefano.bagh...@radicalbit.io> wrote:
>> >> Hi Maximilian,
>> >>
>> >> thank you for the reply. I've checked out the documentation before
>> running
>> >> my tests (I'm not expert enough to not read the docs ;)) but it doesn't
>> >> mention some specific requirement regarding the execution retries, I'll
>> >> check it out, thank!
>> >>
>> >> On Mon, Feb 15, 2016 at 12:51 PM, Maximilian Michels <m...@apache.org>
>> wrote:
>> >>>
>> >>> Hi Stefano,
>> >>>
>> >>> The Job should stop temporarily but then be resumed by the new
>> >>> JobManager. Have you increased the number of execution retries? AFAIK,
>> >>> it is set to 0 by default. This will not re-run the job, even in HA
>> >>> mode. You can enable it on the StreamExecutionEnvironment.
>> >>>
>> >>> Otherwise, you have probably already found the documentation:
>> >>>
>> >>>
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration
>> >>>
>> >>> Cheers,
>> >>> Max
>> >>>
>> >>> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
>> >>> <stefano.bagh...@radicalbit.io> wrote:
>> >>> > Hello everyone,
>> >>> >
>> >>> > last week I've ran some tests with Apache ZooKeeper to get a grip on
>> >>> > Flink
>> >>> > HA features. My tests went bad so far and I can't sort out the
>> reason.
>> >>> >
>> >>> > My latest tests involved Flink 0.10.2, ran as a standalone cluster
>> with
>> >>> > 3
>> >>> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
>> >>> > ensemble.
>> >>> > I've started ZooKeeper on each machine, tested it's availability and
>> >>> > then
>> >>> > started the Flink cluster. Since there's no reliable distributed
>> >>> > filesystem
>> >>> > on the cluster, I had to use the local file system as the state
>> backend.
>> >>> >
>> >>> > I then submitted a very simple streaming job that writes the
>> timestamp
>> >>> > on a
>> >>> > text file on the local file system each second and then went on to
>> kill
>> >>> > the
>> >>> > process running the job manager to verify that another job manager
>> takes
>> >>> > over. However, the job just stopped. I still have to perform some
>> checks
>> >>> > on
>> >>> > the handover to the new job manager, but before digging deeper I
>> wanted
>> >>> > to
>> >>> > ask if my expectation of having the job going despite the job
>> manager
>> >>> > failure is unreasonable.
>> >>> >
>> >>> > Thanks in advance.
>> >>> >
>> >>> > --
>> >>> > BR,
>> >>> > Stefano Baghino
>> >>> >
>> >>> > Software Engineer @ Radicalbit
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> BR,
>> >> Stefano Baghino
>> >>
>> >> Software Engineer @ Radicalbit
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Reply via email to