Thanks a lot for your answers and clarifications Yun. 1+2) Agreed, this can be a future improvement if this becomes a problem.
3) Great, this will help a lot with understanding the FLIP. Cheers, Till On Wed, Jul 14, 2021 at 5:41 PM Yun Gao <yungao...@aliyun.com.invalid> wrote: > Hi Till, > > Very thanks for the review and comments! > > 1) First I think in fact we could be able to do the computation outside of > the main thread, > and the current implementation mainly due to the computation is in general > fast and we > initially want to have a simplified first version. > > The main requirement here is to have a constant view of the state of the > tasks, otherwise > for example if we have A -> B, if A is running when we check if we need to > trigger A, we will > mark A as have to trigger, but if A gets to finished when we check B, we > will also mark B as > have to trigger, then B will receive both rpc trigger and checkpoint > barrier, which would break > some assumption on the task side and complicate the implementation. > > But to cope this issue, we in fact could first have a snapshot of the > tasks' state and then do the > computation, both the two step do not need to be in the main thread. > > 2) For the computation logic, in fact currently we benefit a lot from some > shortcuts on all-to-all > edges and job vertex with all tasks running, these shortcuts could do > checks on the job vertex level > first and skip some job vertices as a whole. With this optimization we > have a O(V) algorithm, and the > current running time of the worst case for a job with 320,000 tasks is > less than 100ms. For > daily graph sizes the time would be further reduced linearly. > > If we do the computation based on the last triggered tasks, we may not > easily encode this information > into the shortcuts on the job vertex level. And since the time seems to be > short, perhaps it is enough > to do re-computation from the scratch in consideration of the tradeoff > between the performance and > the complexity ? > > 3) We are going to emit the EndOfInput event exactly after the finish() > method and before the last > snapshotState() method so that we could shut down the whole topology with > a single final checkpoint. > Very sorry for not include enough details for this part and I'll > complement the FLIP with the details on > the process of the final checkpoint / savepoint. > > Best, > Yun > > > > ------------------------------------------------------------------ > From:Till Rohrmann <trohrm...@apache.org> > Send Time:2021 Jul. 14 (Wed.) 22:05 > To:dev <dev@flink.apache.org> > Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks > Finished > > Hi everyone, > > I am a bit late to the voting party but let me ask three questions: > > 1) Why do we execute the trigger plan computation in the main thread if we > cannot guarantee that all tasks are still running when triggering the > checkpoint? Couldn't we do the computation in a different thread in order > to relieve the main thread a bit. > > 2) The implementation of the DefaultCheckpointPlanCalculator seems to go > over the whole topology for every calculation. Wouldn't it be more > efficient to maintain the set of current tasks to trigger and check whether > anything has changed and if so check the succeeding tasks until we have > found the current checkpoint trigger frontier? > > 3) When are we going to send the endOfInput events to a downstream task? If > this happens after we call finish on the upstream operator but before > snapshotState then it would be possible to shut down the whole topology > with a single final checkpoint. I think this part could benefit from a bit > more detailed description in the FLIP. > > Cheers, > Till > > On Fri, Jul 2, 2021 at 8:36 AM Yun Gao <yungao...@aliyun.com.invalid> > wrote: > > > Hi there, > > > > Since the voting time of FLIP-147[1] has passed, I'm closing the vote > now. > > > > There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes: > > > > - Dawid Wysakowicz (binding) > > - Piotr Nowojski(binding) > > - Jiangang Liu (binding) > > - Arvid Heise (binding) > > - Jing Zhang (binding) > > - Leonard Xu (non-binding) > > - Guowei Ma (binding) > > > > Thus I'm happy to announce that the update to the FLIP-147 is accepted. > > > > Very thanks everyone! > > > > Best, > > Yun > > > > [1] https://cwiki.apache.org/confluence/x/mw-ZCQ > >