[jira] [Comment Edited] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Yu Li (JIRA) Tue, 21 May 2019 23:52:46 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845585#comment-16845585
 ]


Yu Li edited comment on FLINK-8871 at 5/22/19 6:51 AM:
-------------------------------------------------------

bq. His solution (also proposed by us) is now meaningless, I just gave my 
advice. If it makes sense, I have already raised it. I just made a friendly 
suggestion and don’t want to waste his time.
Sorry but I cannot find any such "advice" from you according to JIRA comment 
history, not even any proposal from you in this JIRA actually. And I cannot 
tell what exactly is your stand now from your comments. Let's recall Yun's 
proposal: "introducing {{notifyCheckpointAbort}} and mechanism to cancel 
checkpoints in {{StreamTask}}", you think this proposal is meaningless? If so, 
why and what's your proposal?

bq. I am involved in related work, they will more directly affect the solution 
and design of this issue; I have already claimed two other issues that are 
almost identical to it, and I expect them to be consistent.
I checked FLINK-10966, it was created much later than this one and obvious 
another topic ("Optimize the release blocking logic in BarrierBuffer") although 
{{notifyCheckpointFailed}} is mentioned in discussion. Also checked FLINK-12058 
but obviously it's duplicated of this one and Yun also pointed it out there. 
Checking FLINK-12482 it's irrelative to issue reported here and I cannot see 
where it is mentioned as a blocker. FLINK-12364 which aims at introducing a 
{{CheckpointFailureManager}} is also irrelative to the proposal here. To 
summarize, all JIRAs you mentioned are either duplicated of this one 
(FLINK-12058) or only mentioned part of the solution proposed by Yun here 
(FLINK-10966) or irrelative (FLINK-12364/12482), and more importantly none of 
them is completed, but you are concluding the proposal here is "meaningless"?

I could see clearly that you're involved in many tasks and that's good, but it 
doesn't mean you have the privilege to grab other's already-in-progress work. 
You're welcome to supply suggestions and will be appreciated if offered a hand 
to help, but not linking irrelative stuff together as excuse of a "robbery".


was (Author: carp84):
bq. His solution (also proposed by us) is now meaningless, I just gave my 
advice. If it makes sense, I have already raised it. I just made a friendly 
suggestion and don’t want to waste his time.
Sorry but I cannot find any such "advice" from you according to JIRA comment 
history, not even any proposal from you in this JIRA actually. And I cannot 
tell what exactly is your stand now from your comments. Let's recall Yun's 
proposal: "introducing {{notifyCheckpointAbort}} and mechanism to cancel 
checkpoints in {{StreamTask}}", you think this proposal is meaningless? If so, 
why and what's your proposal?

bq. I am involved in related work, they will more directly affect the solution 
and design of this issue; I have already claimed two other issues that are 
almost identical to it, and I expect them to be consistent.
I checked FLINK-10966, it was created much later than this one and obvious 
another topic ("Optimize the release blocking logic in BarrierBuffer") although 
{{notifyCheckpointFailed}} is mentioned in discussion. Also checked FLINK-12058 
but obviously it's duplicated of this one and Yun also pointed it out there. 
Checking FLINK-12482 it's irrelative to issue reported here and I cannot see 
where it is mentioned as a blocker. FLINK-12364 which aims at introducing a 
{{CheckpointFailureManager}} is also irrelative to the proposal here. To 
summarize, all JIRAs you mentioned are either duplicated of this one 
(FLINK-12058) or only mentioned part of the solution proposed by Yun here 
(FLINK-10966) or irrelative (FLINK-12364/12482), and more importantly none of 
them is completed, but you are concluding the proposal here is "meaningless"?

I could see clearly that you're involved in many tasks and that's good, but it 
doesn't mean you have the privilege to grab other's already-in-progress work. 
You're welcome to supply suggestions and will be appreciated if offered a hand 
to help, but not linking irrelative stuff together as evidence of a "robbery".

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: vinoyang
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Reply via email to