Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished

Aljoscha Krettek Wed, 06 Jan 2021 07:30:14 -0800

On 2021/01/06 13:35, Arvid Heise wrote:

I was actually not thinking about concurrent checkpoints (and actually want
to get rid of them once UC is established, since they are addressing the
same thing).

I would give a yuge +1 to that. I don't see why we would need concurrentcheckpoints in most cases. (Any case even?)

However, I have the impression that you think mostly in terms of tasks and
I mostly think in terms of subtasks. I especially want to have proper
support for bounded sources where one partition is much larger than the
other partitions (might be in conjunction with unbounded sources such that
checkpointing is plausible to begin with). Hence, most of the subtasks are
finished with one struggler remaining. In this case, the barriers are
inserted now only in the struggling source subtask and potentially in any
running downstream subtask.
As far as I have understood, this would require barriers to be inserted
downstream leading to similar race conditions.

No, I'm also thinking in terms of subtasks when it comes to triggering.As long as a subtask has at least one upstream task we don't need tomanually trigger that task. A task will know which of its inputs havefinished, so it will take those out of the calculation that waits forbarriers from all upstream tasks. In the case where only a singleupstream source is remaining the barriers from that task will thentrigger checkpointing at the downstream task.

I'm also concerned about the notion of a final checkpoint. What happens
when this final checkpoint times out (checkpoint timeout > async timeout)
or fails for a different reason? I'm currently more inclined to just let
checkpoints work until the whole graph is completed (and thought this was
the initial goal of the whole FLIP to being with). However, that would
require subtasks to stay alive until they receive checkpiontCompleted
callback (which is currently also not guaranteed)...

The idea is that the final checkpoint is whatever checkpoint succeeds inthe end. When a task (and I mostly mean subtask when I say task) knowsthat it is done it waits for the next successful checkpoint and thenshuts down.

This is a basic question, though: should we simply keep all tasks(subtasks) around forever until the whole graph shuts down? Our answerfor this was *no*, so far. We would like to allow tasks to shut down,such that the resources are freed at that point.


Best,
Aljoscha

Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished

Reply via email to