Thanks for updating the FLIP. Based on the new section about stop-with-savepoint [--drain] I got two other questions:
1) Does endOfInput entail sending of the MAX_WATERMARK? 2) StreamOperator.finish says to flush all buffered events. Would a WindowOperator close all windows and emit the results upon calling finish, for example? Cheers, Till On Thu, Jul 15, 2021 at 10:15 AM Till Rohrmann <trohrm...@apache.org> wrote: > Thanks a lot for your answers and clarifications Yun. > > 1+2) Agreed, this can be a future improvement if this becomes a problem. > > 3) Great, this will help a lot with understanding the FLIP. > > Cheers, > Till > > On Wed, Jul 14, 2021 at 5:41 PM Yun Gao <yungao...@aliyun.com.invalid> > wrote: > >> Hi Till, >> >> Very thanks for the review and comments! >> >> 1) First I think in fact we could be able to do the computation outside >> of the main thread, >> and the current implementation mainly due to the computation is in >> general fast and we >> initially want to have a simplified first version. >> >> The main requirement here is to have a constant view of the state of the >> tasks, otherwise >> for example if we have A -> B, if A is running when we check if we need >> to trigger A, we will >> mark A as have to trigger, but if A gets to finished when we check B, we >> will also mark B as >> have to trigger, then B will receive both rpc trigger and checkpoint >> barrier, which would break >> some assumption on the task side and complicate the implementation. >> >> But to cope this issue, we in fact could first have a snapshot of the >> tasks' state and then do the >> computation, both the two step do not need to be in the main thread. >> >> 2) For the computation logic, in fact currently we benefit a lot from >> some shortcuts on all-to-all >> edges and job vertex with all tasks running, these shortcuts could do >> checks on the job vertex level >> first and skip some job vertices as a whole. With this optimization we >> have a O(V) algorithm, and the >> current running time of the worst case for a job with 320,000 tasks is >> less than 100ms. For >> daily graph sizes the time would be further reduced linearly. >> >> If we do the computation based on the last triggered tasks, we may not >> easily encode this information >> into the shortcuts on the job vertex level. And since the time seems to >> be short, perhaps it is enough >> to do re-computation from the scratch in consideration of the tradeoff >> between the performance and >> the complexity ? >> >> 3) We are going to emit the EndOfInput event exactly after the finish() >> method and before the last >> snapshotState() method so that we could shut down the whole topology with >> a single final checkpoint. >> Very sorry for not include enough details for this part and I'll >> complement the FLIP with the details on >> the process of the final checkpoint / savepoint. >> >> Best, >> Yun >> >> >> >> ------------------------------------------------------------------ >> From:Till Rohrmann <trohrm...@apache.org> >> Send Time:2021 Jul. 14 (Wed.) 22:05 >> To:dev <dev@flink.apache.org> >> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks >> Finished >> >> Hi everyone, >> >> I am a bit late to the voting party but let me ask three questions: >> >> 1) Why do we execute the trigger plan computation in the main thread if we >> cannot guarantee that all tasks are still running when triggering the >> checkpoint? Couldn't we do the computation in a different thread in order >> to relieve the main thread a bit. >> >> 2) The implementation of the DefaultCheckpointPlanCalculator seems to go >> over the whole topology for every calculation. Wouldn't it be more >> efficient to maintain the set of current tasks to trigger and check >> whether >> anything has changed and if so check the succeeding tasks until we have >> found the current checkpoint trigger frontier? >> >> 3) When are we going to send the endOfInput events to a downstream task? >> If >> this happens after we call finish on the upstream operator but before >> snapshotState then it would be possible to shut down the whole topology >> with a single final checkpoint. I think this part could benefit from a bit >> more detailed description in the FLIP. >> >> Cheers, >> Till >> >> On Fri, Jul 2, 2021 at 8:36 AM Yun Gao <yungao...@aliyun.com.invalid> >> wrote: >> >> > Hi there, >> > >> > Since the voting time of FLIP-147[1] has passed, I'm closing the vote >> now. >> > >> > There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes: >> > >> > - Dawid Wysakowicz (binding) >> > - Piotr Nowojski(binding) >> > - Jiangang Liu (binding) >> > - Arvid Heise (binding) >> > - Jing Zhang (binding) >> > - Leonard Xu (non-binding) >> > - Guowei Ma (binding) >> > >> > Thus I'm happy to announce that the update to the FLIP-147 is accepted. >> > >> > Very thanks everyone! >> > >> > Best, >> > Yun >> > >> > [1] https://cwiki.apache.org/confluence/x/mw-ZCQ >> >>