Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

Till Rohrmann Thu, 15 Jul 2021 07:47:10 -0700

Thanks for updating the FLIP. Based on the new section about
stop-with-savepoint [--drain] I got two other questions:


1) Does endOfInput entail sending of the MAX_WATERMARK?

2) StreamOperator.finish says to flush all buffered events. Would a
WindowOperator close all windows and emit the results upon calling
finish, for example?

Cheers,
Till

On Thu, Jul 15, 2021 at 10:15 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Thanks a lot for your answers and clarifications Yun.
>
> 1+2) Agreed, this can be a future improvement if this becomes a problem.
>
> 3) Great, this will help a lot with understanding the FLIP.
>
> Cheers,
> Till
>
> On Wed, Jul 14, 2021 at 5:41 PM Yun Gao <yungao...@aliyun.com.invalid>
> wrote:
>
>> Hi Till,
>>
>> Very thanks for the review and comments!
>>
>> 1) First I think in fact we could be able to do the computation outside
>> of the main thread,
>> and the current implementation mainly due to the computation is in
>> general fast and we
>> initially want to have a simplified first version.
>>
>> The main requirement here is to have a constant view of the state of the
>> tasks, otherwise
>> for example if we have A -> B, if A is running when we check if we need
>> to trigger A, we will
>> mark A as have to trigger, but if A gets to finished when we check B, we
>> will also mark B as
>> have to trigger, then B will receive both rpc trigger and checkpoint
>> barrier, which would break
>> some assumption on the task side and complicate the implementation.
>>
>> But to cope this issue, we in fact could first have a snapshot of the
>> tasks' state and then do the
>> computation, both the two step do not need to be in the main thread.
>>
>> 2) For the computation logic, in fact currently we benefit a lot from
>> some shortcuts on all-to-all
>> edges and job vertex with all tasks running, these shortcuts could do
>> checks on the job vertex level
>> first and skip some job vertices as a whole. With this optimization we
>> have a O(V) algorithm, and the
>> current running time of the worst case for a job with 320,000 tasks is
>> less than 100ms. For
>> daily graph sizes the time would be further reduced linearly.
>>
>> If we do the computation based on the last triggered tasks, we may not
>> easily encode this information
>> into the shortcuts on the job vertex level. And since the time seems to
>> be short, perhaps it is enough
>> to do re-computation from the scratch in consideration of the tradeoff
>> between the performance and
>> the complexity ?
>>
>> 3) We are going to emit the EndOfInput event exactly after the finish()
>> method and before the last
>> snapshotState() method so that we could shut down the whole topology with
>> a single final checkpoint.
>> Very sorry for not include enough details for this part and I'll
>> complement the FLIP with the details on
>> the process of the final checkpoint / savepoint.
>>
>> Best,
>> Yun
>>
>>
>>
>> ------------------------------------------------------------------
>> From:Till Rohrmann <trohrm...@apache.org>
>> Send Time:2021 Jul. 14 (Wed.) 22:05
>> To:dev <dev@flink.apache.org>
>> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
>> Finished
>>
>> Hi everyone,
>>
>> I am a bit late to the voting party but let me ask three questions:
>>
>> 1) Why do we execute the trigger plan computation in the main thread if we
>> cannot guarantee that all tasks are still running when triggering the
>> checkpoint? Couldn't we do the computation in a different thread in order
>> to relieve the main thread a bit.
>>
>> 2) The implementation of the DefaultCheckpointPlanCalculator seems to go
>> over the whole topology for every calculation. Wouldn't it be more
>> efficient to maintain the set of current tasks to trigger and check
>> whether
>> anything has changed and if so check the succeeding tasks until we have
>> found the current checkpoint trigger frontier?
>>
>> 3) When are we going to send the endOfInput events to a downstream task?
>> If
>> this happens after we call finish on the upstream operator but before
>> snapshotState then it would be possible to shut down the whole topology
>> with a single final checkpoint. I think this part could benefit from a bit
>> more detailed description in the FLIP.
>>
>> Cheers,
>> Till
>>
>> On Fri, Jul 2, 2021 at 8:36 AM Yun Gao <yungao...@aliyun.com.invalid>
>> wrote:
>>
>> > Hi there,
>> >
>> > Since the voting time of FLIP-147[1] has passed, I'm closing the vote
>> now.
>> >
>> > There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes:
>> >
>> > - Dawid Wysakowicz (binding)
>> > - Piotr Nowojski(binding)
>> > - Jiangang Liu (binding)
>> > - Arvid Heise (binding)
>> > - Jing Zhang (binding)
>> > - Leonard Xu (non-binding)
>> > - Guowei Ma (binding)
>> >
>> > Thus I'm happy to announce that the update to the FLIP-147 is accepted.
>> >
>> > Very thanks everyone!
>> >
>> > Best,
>> > Yun
>> >
>> > [1]  https://cwiki.apache.org/confluence/x/mw-ZCQ
>>
>>

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

Reply via email to