Hi!
As a bit of background: ZooKeeper allows you only to store very small data.
We hence persist only the changing checkpoint metadata in ZooKeeper.
To recover a job, some constant data is also needed: The JobGraph, and the
JarFiles. These cannot go to ZooKeeper, but need to go to a reliable
stor
Ok, simply turning up HDFS on the cluster and using it as the state backend
fixed the issue. Thank you both for the help!
On Mon, Feb 15, 2016 at 5:45 PM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:
> You can find the log of the recovering job manager here:
> https://gist.github.com/s
You can find the log of the recovering job manager here:
https://gist.github.com/stefanobaghino/ae28f00efb6bdd907b42
Basically, what Ufuk said happened: the job manager tried to fill in for
the lost one but couldn't find the actual data because it looked it up
locally whereas due to my configurati
Hi Stefano,
A correction from my side: You don't need to set the execution retries
for HA because a new JobManager will automatically take over and
resubmit all jobs which were recovered from the storage directory you
set up. The number of execution retries applies only to jobs which are
restarted
> On 15 Feb 2016, at 13:40, Stefano Baghino
> wrote:
>
> Hi Ufuk, thanks for replying.
>
> Regarding the masters file: yes, I've specified all the masters and checked
> out that they were actually running after the start-cluster.sh. I'll gladly
> share the logs as soon as I get to see them.
Hi Stefano,
That is true. The documentation doesn't mention that. Just wanted to
point you to the documentation if anything else needs to be
configured. We will update it.
Instead of setting the number of execution retries on the
StreamExecutionEnvironment, you may also set
"execution-retries.def
Hi Maximilian,
thank you for the reply. I've checked out the documentation before running
my tests (I'm not expert enough to not read the docs ;)) but it doesn't
mention some specific requirement regarding the execution retries, I'll
check it out, thank!
On Mon, Feb 15, 2016 at 12:51 PM, Maximili
Hi Ufuk, thanks for replying.
Regarding the masters file: yes, I've specified all the masters and checked
out that they were actually running after the start-cluster.sh. I'll gladly
share the logs as soon as I get to see them.
Regarding the state backend: how does having a non-distributed storage
Hi Stefano,
The Job should stop temporarily but then be resumed by the new
JobManager. Have you increased the number of execution retries? AFAIK,
it is set to 0 by default. This will not re-run the job, even in HA
mode. You can enable it on the StreamExecutionEnvironment.
Otherwise, you have prob
Using the local file system as state backend only works if all job
managers run on the same machine. Is that the case?
Have you specified all job managers in the masters file? With the
local file system state backend only something like
host-X
host-X
host-X
will be a valid masters configuration.
Hello everyone,
last week I've ran some tests with Apache ZooKeeper to get a grip on Flink
HA features. My tests went bad so far and I can't sort out the reason.
My latest tests involved Flink 0.10.2, ran as a standalone cluster with 3
masters and 4 slaves. The 3 masters are also the ZooKeeper (3
11 matches
Mail list logo