Hey Gyula, That's a very interesting idea. The discussion about the `Individual` vs `Global` checkpoint was raised before, but the main concern was from two aspects:
- Non-deterministic replaying may lead to an inconsistent view of checkpoint - It is not easy to form a clear cut of past and future and hence no clear cut of where the start point of the source should begin to replay from. Starting from independent subgraphs as you proposed may be a good starting point. However, when we talk about subgraph, do we mention it as a job subgraph (each vertex is one or more operators) or execution subgraph (each vertex is a task instance)? If it is a job subgraph, then indeed, why not separate it into multiple jobs as Caizhi mentioned. If it is an execution subgraph, then it is difficult to handle rescaling due to inconsistent views of checkpoints between tasks of the same operator. `Individual/Subgraph Checkpointing` is definitely an interesting direction to think of, and I'd love to hear more from you! Best, Yuan On Mon, Feb 7, 2022 at 10:16 AM Caizhi Weng <tsreape...@gmail.com> wrote: > Hi Gyula! > > Thanks for raising this discussion. I agree that this will be an > interesting feature but I actually have some doubts about the motivation > and use case. If there are multiple individual subgraphs in the same job, > why not just distribute them to multiple jobs so that each job contains > only one individual graph and can now fail without disturbing the others? > > > Gyula Fóra <gyf...@apache.org> 于2022年2月7日周一 05:22写道: > > > Hi all! > > > > At the moment checkpointing only works for healthy jobs with all running > > (or some finished) tasks. This sounds reasonable in most cases but there > > are a few applications where it would make sense to checkpoint failing > jobs > > as well. > > > > Due to how the checkpointing mechanism works, subgraphs that have a > failing > > task cannot be checkpointed without violating the exactly-once semantics. > > However if the job has multiple independent subgraphs (that are not > > connected to each other), even if one subgraph is failing, the other > > completely running one could be checkpointed. In these cases the tasks of > > the failing subgraph could simply inherit the last successful checkpoint > > metadata (before they started failing). This logic would produce a > > consistent checkpoint. > > > > The job as a whole could now make stateful progress even if some > subgraphs > > are constantly failing. This can be very valuable if for some reason the > > job has a larger number of independent subgraphs that are expected to > fail > > every once in a while, or if some subgraphs can have longer downtimes > that > > would now cause the whole job to stall. > > > > What do you think? > > > > Cheers, > > Gyula > > >