[ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848170#comment-16848170 ]
vinoyang commented on FLINK-8871: --------------------------------- Hi [~srichter] First of all, there are some non-theme related discussions on this issue, and I am one of the participants. In this regard, I am sorry about this and we should try our best not to let this happen again. Thank you for helping to sort out the relationship and priority between these issues. I basically agree with you. There is only one question that I don't know if we can revisit it. I am not denying the importance of FLINK-11662, but it seems that it can be paralleled with the checkpoint failure processing logic. Although FLINK-11662 will also cause the job to fail, it will not affect the semantics of this {{CheckpointFailureManager}} (the failure and recovery of the Job will cause the {{CheckpointFailureManager}} to start working again). And, I think these three steps which have been contained will make a lot of work in simplifying the checkpoint related code logic and other work will also benefit. I am currently implementing a refactoring of checkpoint failure processing, I will give more details here. Initially, it was designed to meet the need for more flexible failure-tolerant conditions, such as tolerating consecutive n checkpoint failures. But over time, it contains a larger theme and is divided into three parts: * FLINK-10724: Reconstruction of checkpoint related classes: no longer distinguish between the trigger phase and the execution phase, introduce a unified exception class {{CheckpointExecution}}, and enumerate all possible failure causes CheckpointFailureReason; refactor {{PendingCheckpoint#abortXXX}}. In fact, when I implemented this [PR-8322|https://github.com/apache/flink/pull/8322], I found that I also missed a bunch of decline related exceptions, I hope it will be completed in the second step; * FLINK-12364: Introduced checkpoint failure manager: We simplified and refactored many configurations and introduced a simple counting mechanism; currently there are two problems with this [PR|https://github.com/apache/flink/pull/8322] (1) It has not been compatible with failTaskOnCheckpointError , which makes me aware our first step's work has not been completely finished; (2) the mechanism of failJob may be problematic and needs further confirmation; in addition, I also believe that more designs in the future will benefit from {{CheckpointFailureManager}}, which can take more responsibility and separated more logic from the CheckpointCoordinator, as mentioned in my previous comments; * FLINK-12209: The TM is no longer allowed to explicitly fail job due to the configuration of the checkpoint. This step is blocked by the second step, which will allow us to discard the configuration of failTaskOnCheckpointError. From then on, the TM side will not be able to explicitly fail the job based on this configuration. This will benefit other issues. Without a doubt, I agree with the importance of FLINK-11662. My question is whether we can reconsider them to be parallelly processed. Because the refactoring is currently in an intermediate stage (and the second step will continue to process the first step's residual work). In fact, we have done a similar job as [~yunta]'s team has done, as the previous comment said. Just, you know, I am still discussing the issue of the other two checkpoints with Gyula. In view of some of the discordant factors caused by this issue, I will not fight for this task. In addition, there is no doubt that I admit that especially the issue of multiple issues, I have no time to pay attention to FLINK-11662, maybe [~yunta] has time to deal with it. I will take the time to give more details of the checkpoint cleaner as soon as possible. I think we all have a consensus: FLINK-10855 will inevitably be blocked by other work, so its priority is not currently the highest. > Checkpoint cancellation is not propagated to stop checkpointing threads on > the task manager > ------------------------------------------------------------------------------------------- > > Key: FLINK-8871 > URL: https://issues.apache.org/jira/browse/FLINK-8871 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0 > Reporter: Stefan Richter > Priority: Critical > > Flink currently lacks any form of feedback mechanism from the job manager / > checkpoint coordinator to the tasks when it comes to failing a checkpoint. > This means that running snapshots on the tasks are also not stopped even if > their owning checkpoint is already cancelled. Two examples for cases where > this applies are checkpoint timeouts and local checkpoint failures on a task > together with a configuration that does not fail tasks on checkpoint failure. > Notice that those running snapshots do no longer account for the maximum > number of parallel checkpoints, because their owning checkpoint is considered > as cancelled. > Not stopping the task's snapshot thread can lead to a problematic situation > where the next checkpoints already started, while the abandoned checkpoint > thread from a previous checkpoint is still lingering around running. This > scenario can potentially cascade: many parallel checkpoints will slow down > checkpointing and make timeouts even more likely. > > A possible solution is introducing a {{cancelCheckpoint}} method as > counterpart to the {{triggerCheckpoint}} method in the task manager gateway, > which is invoked by the checkpoint coordinator as part of cancelling the > checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)