Re: Issues testing Flink HA w/ ZooKeeper

Stefano Baghino Tue, 16 Feb 2016 02:31:45 -0800

Ok, simply turning up HDFS on the cluster and using it as the state backend
fixed the issue. Thank you both for the help!


On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino <
[email protected]> wrote:

> You can find the log of the recovering job manager here:
> https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42
>
> Basically, what Ufuk said happened: the job manager tried to fill in for
> the lost one but couldn't find the actual data because it looked it up
> locally whereas due to my configuration it was actually stored on another
> machine.
>
> Thanks for the help, it's really been precious!
>
> On Mon, Feb 15, 2016 at 5:24 PM, Maximilian Michels <[email protected]>
> wrote:
>
>> Hi Stefano,
>>
>> A correction from my side: You don't need to set the execution retries
>> for HA because a new JobManager will automatically take over and
>> resubmit all jobs which were recovered from the storage directory you
>> set up. The number of execution retries applies only to jobs which are
>> restarted due to a TaskManager failure.
>>
>> It would be great if you could supply some logs.
>>
>> Cheers,
>> Max
>>
>>
>> On Mon, Feb 15, 2016 at 1:45 PM, Maximilian Michels <[email protected]>
>> wrote:
>> > Hi Stefano,
>> >
>> > That is true. The documentation doesn't mention that. Just wanted to
>> > point you to the documentation if anything else needs to be
>> > configured. We will update it.
>> >
>> > Instead of setting the number of execution retries on the
>> > StreamExecutionEnvironment, you may also set
>> > "execution-retries.default" in the flink-conf.yaml. Let us know if
>> > that fixes your setup.
>> >
>> > Cheers,
>> > Max
>> >
>> > On Mon, Feb 15, 2016 at 1:41 PM, Stefano Baghino
>> > <[email protected]> wrote:
>> >> Hi Maximilian,
>> >>
>> >> thank you for the reply. I've checked out the documentation before
>> running
>> >> my tests (I'm not expert enough to not read the docs ;)) but it doesn't
>> >> mention some specific requirement regarding the execution retries, I'll
>> >> check it out, thank!
>> >>
>> >> On Mon, Feb 15, 2016 at 12:51 PM, Maximilian Michels <[email protected]>
>> wrote:
>> >>>
>> >>> Hi Stefano,
>> >>>
>> >>> The Job should stop temporarily but then be resumed by the new
>> >>> JobManager. Have you increased the number of execution retries? AFAIK,
>> >>> it is set to 0 by default. This will not re-run the job, even in HA
>> >>> mode. You can enable it on the StreamExecutionEnvironment.
>> >>>
>> >>> Otherwise, you have probably already found the documentation:
>> >>>
>> >>>
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#configuration
>> >>>
>> >>> Cheers,
>> >>> Max
>> >>>
>> >>> On Mon, Feb 15, 2016 at 12:35 PM, Stefano Baghino
>> >>> <[email protected]> wrote:
>> >>> > Hello everyone,
>> >>> >
>> >>> > last week I've ran some tests with Apache ZooKeeper to get a grip on
>> >>> > Flink
>> >>> > HA features. My tests went bad so far and I can't sort out the
>> reason.
>> >>> >
>> >>> > My latest tests involved Flink 0.10.2, ran as a standalone cluster
>> with
>> >>> > 3
>> >>> > masters and 4 slaves. The 3 masters are also the ZooKeeper (3.4.6)
>> >>> > ensemble.
>> >>> > I've started ZooKeeper on each machine, tested it's availability and
>> >>> > then
>> >>> > started the Flink cluster. Since there's no reliable distributed
>> >>> > filesystem
>> >>> > on the cluster, I had to use the local file system as the state
>> backend.
>> >>> >
>> >>> > I then submitted a very simple streaming job that writes the
>> timestamp
>> >>> > on a
>> >>> > text file on the local file system each second and then went on to
>> kill
>> >>> > the
>> >>> > process running the job manager to verify that another job manager
>> takes
>> >>> > over. However, the job just stopped. I still have to perform some
>> checks
>> >>> > on
>> >>> > the handover to the new job manager, but before digging deeper I
>> wanted
>> >>> > to
>> >>> > ask if my expectation of having the job going despite the job
>> manager
>> >>> > failure is unreasonable.
>> >>> >
>> >>> > Thanks in advance.
>> >>> >
>> >>> > --
>> >>> > BR,
>> >>> > Stefano Baghino
>> >>> >
>> >>> > Software Engineer @ Radicalbit
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> BR,
>> >> Stefano Baghino
>> >>
>> >> Software Engineer @ Radicalbit
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Re: Issues testing Flink HA w/ ZooKeeper

Reply via email to