Fault Tolerance for Flink Iterations

Markus Holzemer Tue, 21 Apr 2015 00:14:34 -0700

Hi everybody,

I am writing my master thesis about making flink iterations / iterative
flink algorithms fault tolerant.
The first approach I implemented is a basic checkpointing, where every N
iterations the current state is saved into HDFS.
To do this I enabled data sinks inside of iterations, then attached a new
checkpointing sink to the beginning of each iteration. To recover from a
previous checkpoint I cancel all tasks, add a new datasource in front of
the iteration and reschedule the tasks with lower dop. I do this out of the
JobManager during runtime without starting a new job.
The problem is that sometimes the input data to the iteration has some
properties like a certain partitioning or sorting, and I am struggeling
with reconstructing theses properties from the checkpoint source.
I figured that an easier way to do this is to re-optimize the new plan
(with the new source as input to the iteration) before the rescheduling.
But in the current project structure flink-runtime has no access to
flink-optimizer and it would be a major design break to change this.
Has somebody any advice on this?


best,
Markus

Fault Tolerance for Flink Iterations

Reply via email to