Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

Till Rohrmann Thu, 15 Jul 2021 01:15:27 -0700

Thanks a lot for your answers and clarifications Yun.

1+2) Agreed, this can be a future improvement if this becomes a problem.


3) Great, this will help a lot with understanding the FLIP.

Cheers,
Till

On Wed, Jul 14, 2021 at 5:41 PM Yun Gao <yungao...@aliyun.com.invalid>
wrote:

> Hi Till,
>
> Very thanks for the review and comments!
>
> 1) First I think in fact we could be able to do the computation outside of
> the main thread,
> and the current implementation mainly due to the computation is in general
> fast and we
> initially want to have a simplified first version.
>
> The main requirement here is to have a constant view of the state of the
> tasks, otherwise
> for example if we have A -> B, if A is running when we check if we need to
> trigger A, we will
> mark A as have to trigger, but if A gets to finished when we check B, we
> will also mark B as
> have to trigger, then B will receive both rpc trigger and checkpoint
> barrier, which would break
> some assumption on the task side and complicate the implementation.
>
> But to cope this issue, we in fact could first have a snapshot of the
> tasks' state and then do the
> computation, both the two step do not need to be in the main thread.
>
> 2) For the computation logic, in fact currently we benefit a lot from some
> shortcuts on all-to-all
> edges and job vertex with all tasks running, these shortcuts could do
> checks on the job vertex level
> first and skip some job vertices as a whole. With this optimization we
> have a O(V) algorithm, and the
> current running time of the worst case for a job with 320,000 tasks is
> less than 100ms. For
> daily graph sizes the time would be further reduced linearly.
>
> If we do the computation based on the last triggered tasks, we may not
> easily encode this information
> into the shortcuts on the job vertex level. And since the time seems to be
> short, perhaps it is enough
> to do re-computation from the scratch in consideration of the tradeoff
> between the performance and
> the complexity ?
>
> 3) We are going to emit the EndOfInput event exactly after the finish()
> method and before the last
> snapshotState() method so that we could shut down the whole topology with
> a single final checkpoint.
> Very sorry for not include enough details for this part and I'll
> complement the FLIP with the details on
> the process of the final checkpoint / savepoint.
>
> Best,
> Yun
>
>
>
> ------------------------------------------------------------------
> From:Till Rohrmann <trohrm...@apache.org>
> Send Time:2021 Jul. 14 (Wed.) 22:05
> To:dev <dev@flink.apache.org>
> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
> Finished
>
> Hi everyone,
>
> I am a bit late to the voting party but let me ask three questions:
>
> 1) Why do we execute the trigger plan computation in the main thread if we
> cannot guarantee that all tasks are still running when triggering the
> checkpoint? Couldn't we do the computation in a different thread in order
> to relieve the main thread a bit.
>
> 2) The implementation of the DefaultCheckpointPlanCalculator seems to go
> over the whole topology for every calculation. Wouldn't it be more
> efficient to maintain the set of current tasks to trigger and check whether
> anything has changed and if so check the succeeding tasks until we have
> found the current checkpoint trigger frontier?
>
> 3) When are we going to send the endOfInput events to a downstream task? If
> this happens after we call finish on the upstream operator but before
> snapshotState then it would be possible to shut down the whole topology
> with a single final checkpoint. I think this part could benefit from a bit
> more detailed description in the FLIP.
>
> Cheers,
> Till
>
> On Fri, Jul 2, 2021 at 8:36 AM Yun Gao <yungao...@aliyun.com.invalid>
> wrote:
>
> > Hi there,
> >
> > Since the voting time of FLIP-147[1] has passed, I'm closing the vote
> now.
> >
> > There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes:
> >
> > - Dawid Wysakowicz (binding)
> > - Piotr Nowojski(binding)
> > - Jiangang Liu (binding)
> > - Arvid Heise (binding)
> > - Jing Zhang (binding)
> > - Leonard Xu (non-binding)
> > - Guowei Ma (binding)
> >
> > Thus I'm happy to announce that the update to the FLIP-147 is accepted.
> >
> > Very thanks everyone!
> >
> > Best,
> > Yun
> >
> > [1]  https://cwiki.apache.org/confluence/x/mw-ZCQ
>
>

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

Reply via email to