At the moment, the system can only deal with lost slots (nodes) if either
there are some excess slots which have not been used before or if the died
node is restarted. The latter is the case for yarn applications, for
example. There the application master will restart containers which have
died.
I
Thank you, Till!
The current (in progress) implementation is considering also the problem
related to losing the task's slots of the failed node(s), something related to
[2] ?
[2] https://issues.apache.org/jira/browse/FLINK-3047
Best,
Ovidiu
> On 22 Feb 2016, at 18:13, Till Rohrmann wrote:
>
Hi Ovidiu,
at the moment Flink's batch fault tolerance restarts the whole job in case
of a failure. However, parts of the logic to do partial backtracking such
as intermediate result partitions and the backtracking algorithm are
already implemented or exist as a PR [1]. So we hope to complete the
Hi
In case of failure of a node what does it mean 'Fault tolerance for programs in
the DataSet API works by retrying failed executions’ [1] ?
-work already done by the rest of the nodes is not lost, only work of the lost
node is recomputed, job execution will continue
or
-entire job execution is