[ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845585#comment-16845585 ]
Yu Li edited comment on FLINK-8871 at 5/22/19 6:51 AM: ------------------------------------------------------- bq. His solution (also proposed by us) is now meaningless, I just gave my advice. If it makes sense, I have already raised it. I just made a friendly suggestion and don’t want to waste his time. Sorry but I cannot find any such "advice" from you according to JIRA comment history, not even any proposal from you in this JIRA actually. And I cannot tell what exactly is your stand now from your comments. Let's recall Yun's proposal: "introducing {{notifyCheckpointAbort}} and mechanism to cancel checkpoints in {{StreamTask}}", you think this proposal is meaningless? If so, why and what's your proposal? bq. I am involved in related work, they will more directly affect the solution and design of this issue; I have already claimed two other issues that are almost identical to it, and I expect them to be consistent. I checked FLINK-10966, it was created much later than this one and obvious another topic ("Optimize the release blocking logic in BarrierBuffer") although {{notifyCheckpointFailed}} is mentioned in discussion. Also checked FLINK-12058 but obviously it's duplicated of this one and Yun also pointed it out there. Checking FLINK-12482 it's irrelative to issue reported here and I cannot see where it is mentioned as a blocker. FLINK-12364 which aims at introducing a {{CheckpointFailureManager}} is also irrelative to the proposal here. To summarize, all JIRAs you mentioned are either duplicated of this one (FLINK-12058) or only mentioned part of the solution proposed by Yun here (FLINK-10966) or irrelative (FLINK-12364/12482), and more importantly none of them is completed, but you are concluding the proposal here is "meaningless"? I could see clearly that you're involved in many tasks and that's good, but it doesn't mean you have the privilege to grab other's already-in-progress work. You're welcome to supply suggestions and will be appreciated if offered a hand to help, but not linking irrelative stuff together as excuse of a "robbery". was (Author: carp84): bq. His solution (also proposed by us) is now meaningless, I just gave my advice. If it makes sense, I have already raised it. I just made a friendly suggestion and don’t want to waste his time. Sorry but I cannot find any such "advice" from you according to JIRA comment history, not even any proposal from you in this JIRA actually. And I cannot tell what exactly is your stand now from your comments. Let's recall Yun's proposal: "introducing {{notifyCheckpointAbort}} and mechanism to cancel checkpoints in {{StreamTask}}", you think this proposal is meaningless? If so, why and what's your proposal? bq. I am involved in related work, they will more directly affect the solution and design of this issue; I have already claimed two other issues that are almost identical to it, and I expect them to be consistent. I checked FLINK-10966, it was created much later than this one and obvious another topic ("Optimize the release blocking logic in BarrierBuffer") although {{notifyCheckpointFailed}} is mentioned in discussion. Also checked FLINK-12058 but obviously it's duplicated of this one and Yun also pointed it out there. Checking FLINK-12482 it's irrelative to issue reported here and I cannot see where it is mentioned as a blocker. FLINK-12364 which aims at introducing a {{CheckpointFailureManager}} is also irrelative to the proposal here. To summarize, all JIRAs you mentioned are either duplicated of this one (FLINK-12058) or only mentioned part of the solution proposed by Yun here (FLINK-10966) or irrelative (FLINK-12364/12482), and more importantly none of them is completed, but you are concluding the proposal here is "meaningless"? I could see clearly that you're involved in many tasks and that's good, but it doesn't mean you have the privilege to grab other's already-in-progress work. You're welcome to supply suggestions and will be appreciated if offered a hand to help, but not linking irrelative stuff together as evidence of a "robbery". > Checkpoint cancellation is not propagated to stop checkpointing threads on > the task manager > ------------------------------------------------------------------------------------------- > > Key: FLINK-8871 > URL: https://issues.apache.org/jira/browse/FLINK-8871 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0 > Reporter: Stefan Richter > Assignee: vinoyang > Priority: Critical > > Flink currently lacks any form of feedback mechanism from the job manager / > checkpoint coordinator to the tasks when it comes to failing a checkpoint. > This means that running snapshots on the tasks are also not stopped even if > their owning checkpoint is already cancelled. Two examples for cases where > this applies are checkpoint timeouts and local checkpoint failures on a task > together with a configuration that does not fail tasks on checkpoint failure. > Notice that those running snapshots do no longer account for the maximum > number of parallel checkpoints, because their owning checkpoint is considered > as cancelled. > Not stopping the task's snapshot thread can lead to a problematic situation > where the next checkpoints already started, while the abandoned checkpoint > thread from a previous checkpoint is still lingering around running. This > scenario can potentially cascade: many parallel checkpoints will slow down > checkpointing and make timeouts even more likely. > > A possible solution is introducing a {{cancelCheckpoint}} method as > counterpart to the {{triggerCheckpoint}} method in the task manager gateway, > which is invoked by the checkpoint coordinator as part of cancelling the > checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)