Hi Arvid +1 for this future which has been hoped for a long time. End-to-end exactly once job could benefit from quicker checkpoint completion.
Best Yun Tang ________________________________ From: Yun Gao <yungao...@aliyun.com.INVALID> Sent: Thursday, October 10, 2019 18:39 To: dev <dev@flink.apache.org> Subject: Re: [DISCUSS] FLIP-76: Unaligned checkpoints Hi Arvid, Very thanks for bring up the discussion! From our side unable to finish the checkpoint is commonly met for online jobs, therefore +1 from my side to implement this. A tiny issue of the FLIP is that the Discussion Thread URL attached seems to be not right. Best, Yun ------------------------------------------------------------------ From:Arvid Heise <ar...@ververica.com> Send Time:2019 Sep. 30 (Mon.) 20:31 To:dev <dev@flink.apache.org> Subject:[DISCUSS] FLIP-76: Unaligned checkpoints Hi Devs, I would like to start the formal discussion about FLIP-76 [1], which improves the checkpoint latency in systems under backpressure, where a checkpoint can take hours to complete in the worst case. I recommend the thread "checkpointing under backpressure" [2] to get a good idea why users are not satisfied with the current behavior. The key points: - Since the checkpoint barrier flows much slower through the back-pressured channels, the other channels and their upstream operators are effectively blocked during checkpointing. - The checkpoint barrier takes a long time to reach the sinks causing long checkpointing times. A longer checkpointing time in turn means that the checkpoint will be fairly outdated once done. Since a heavily utilized pipeline is inherently more fragile, we may run into a vicious cycle of late checkpoints, crash, recovery to a rather outdated checkpoint, more back pressure, and even later checkpoints, which would result in little to no progress in the application. The FLIP proposes "unaligned checkpoints" which improves the current state, such that - Upstream processes can continue to produce data, even if some operator still waits on a checkpoint barrier on a specific input channel. - Checkpointing times are heavily reduced across the execution graph, even for operators with a single input channel. - End-users will see more progress even in unstable environments as more up-to-date checkpoints will avoid too many recomputations. - Facilitate faster rescaling. The key idea is to allow checkpoint barriers to be forwarded to downstream tasks before the synchronous part of the checkpointing has been conducted (see Fig. 1). To that end, we need to store in-flight data as part of the checkpoint as described in greater details in this FLIP. Although the basic idea was already sketched in [2], we would like get broader feedback in this dedicated mail thread. Best, Arvid [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Checkpointing-under-backpressure-td31616.html