[ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845631#comment-16845631 ]
vinoyang commented on FLINK-8871: --------------------------------- [~carp84] Please pay attention to your words, I think you are close to attacking me. Who was this issue previously assigned to? How do you think that I did not give advice? [Is this not a suggestion|https://issues.apache.org/jira/browse/FLINK-8871?focusedCommentId=16790118&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16790118]? Should I be more straightforward at that time: It doesn't make sense to submit a PR now? Because it will not be passed before the pre-work is completed, and huge code changes can no longer be achieved on the basis of the original. I don't think it makes sense to compare the time the issue was created. We should compare the schemes that were first proposed under the two issues by us. If you compare the creation time, then the most qualified to mention PR should be sihua zhou and stefan. The solution for FLINK-10966 is the same, although its title is not clear enough. Others, the issue is still normal at the moment we discuss the issue. Regarding FLINK-12482, obviously you don't understand a lot of details. This is why I don't recommend you to participate in the discussion. Can you let [~yunta] express his own opinion? Stephan let us wait for FLINK-12477. FLINK-12482 is its subtask. The way to implement notifyCheckpointAbort is very similar to the way to implement notifyComplete. If notifyComplete is to be refactored, what is the significance of implementing notifyCheckpointAbort now? What I mean by "meaninglessness" refers to the point in time when we first talked. Considering the huge changes in the predecessor work, it obviously will not be merged before the predecessor work. What is the value? And the introduction of the actor mode is related. The implementation is completely different. I mentioned CheckpointFailureManager because I expect a checkpoint exception to be processed, and if it chooses to tolerate failure, then it should also be responsible for the cleanup. This PR has been approved, but the design I mentioned is still being conceived. Why do you think they are irrelevant? > Checkpoint cancellation is not propagated to stop checkpointing threads on > the task manager > ------------------------------------------------------------------------------------------- > > Key: FLINK-8871 > URL: https://issues.apache.org/jira/browse/FLINK-8871 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0 > Reporter: Stefan Richter > Assignee: vinoyang > Priority: Critical > > Flink currently lacks any form of feedback mechanism from the job manager / > checkpoint coordinator to the tasks when it comes to failing a checkpoint. > This means that running snapshots on the tasks are also not stopped even if > their owning checkpoint is already cancelled. Two examples for cases where > this applies are checkpoint timeouts and local checkpoint failures on a task > together with a configuration that does not fail tasks on checkpoint failure. > Notice that those running snapshots do no longer account for the maximum > number of parallel checkpoints, because their owning checkpoint is considered > as cancelled. > Not stopping the task's snapshot thread can lead to a problematic situation > where the next checkpoints already started, while the abandoned checkpoint > thread from a previous checkpoint is still lingering around running. This > scenario can potentially cascade: many parallel checkpoints will slow down > checkpointing and make timeouts even more likely. > > A possible solution is introducing a {{cancelCheckpoint}} method as > counterpart to the {{triggerCheckpoint}} method in the task manager gateway, > which is invoked by the checkpoint coordinator as part of cancelling the > checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)