On 2021/01/06 13:35, Arvid Heise wrote:
I was actually not thinking about concurrent checkpoints (and actually want
to get rid of them once UC is established, since they are addressing the
same thing).

I would give a yuge +1 to that. I don't see why we would need concurrent checkpoints in most cases. (Any case even?)

However, I have the impression that you think mostly in terms of tasks and
I mostly think in terms of subtasks. I especially want to have proper
support for bounded sources where one partition is much larger than the
other partitions (might be in conjunction with unbounded sources such that
checkpointing is plausible to begin with). Hence, most of the subtasks are
finished with one struggler remaining. In this case, the barriers are
inserted now only in the struggling source subtask and potentially in any
running downstream subtask.
As far as I have understood, this would require barriers to be inserted
downstream leading to similar race conditions.

No, I'm also thinking in terms of subtasks when it comes to triggering. As long as a subtask has at least one upstream task we don't need to manually trigger that task. A task will know which of its inputs have finished, so it will take those out of the calculation that waits for barriers from all upstream tasks. In the case where only a single upstream source is remaining the barriers from that task will then trigger checkpointing at the downstream task.

I'm also concerned about the notion of a final checkpoint. What happens
when this final checkpoint times out (checkpoint timeout > async timeout)
or fails for a different reason? I'm currently more inclined to just let
checkpoints work until the whole graph is completed (and thought this was
the initial goal of the whole FLIP to being with). However, that would
require subtasks to stay alive until they receive checkpiontCompleted
callback (which is currently also not guaranteed)...

The idea is that the final checkpoint is whatever checkpoint succeeds in the end. When a task (and I mostly mean subtask when I say task) knows that it is done it waits for the next successful checkpoint and then shuts down.

This is a basic question, though: should we simply keep all tasks (subtasks) around forever until the whole graph shuts down? Our answer for this was *no*, so far. We would like to allow tasks to shut down, such that the resources are freed at that point.

Best,
Aljoscha

Reply via email to