Re: [DISCUSS] Checkpoint Failure process improvement

vino yang Tue, 15 Jan 2019 04:03:09 -0800

Hi all,

I will try to start coding based on the design document. Any feedback is
welcome throughout the process.


Best,
Vino

vino yang <yanghua1...@gmail.com> 于2019年1月9日周三 上午12:29写道：

> Hi all,
>
>
> Currently, the checkpoint's failure handling logic is somewhat confusing
> (not focused), which makes some functions on existing code passive.
>
> So I provide a design document to improve the Checkpoint failure process
> logic.
>
> This design document primarily describes how to improve checkpoint failure
> handling logic and make it more clear.
>
> Based on this, we introduce a CheckpointFailureManager, which makes the
> checkpoint failure processing more flexible.
>
> This mainly comes from the following appeals:
>
>
>    -
>
>    FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
>    "n" unsuccessful checkpoints
>    -
>
>    FLINK-10074[3]: Allowable number of checkpoint failure
>    -
>
>    FLINK-10724[2]: Refactor failure handling in checkpoint coordinator
>
>
>
> https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing
>
> *Thanks to @Andrey Zagrebin for helping me review the documentation and
> suggesting a lot of improvements.*
>
> Feedback and comments are very welcome!
>
> Best,
> Vino
>
> [1]: https://issues.apache.org/jira/browse/FLINK-4810
>
> [2]: https://issues.apache.org/jira/browse/FLINK-10724
> [3]: https://issues.apache.org/jira/browse/FLINK-10074
>

Re: [DISCUSS] Checkpoint Failure process improvement

Reply via email to