[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

vinoyang (JIRA) Tue, 21 May 2019 04:02:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844724#comment-16844724
 ]


vinoyang commented on FLINK-8871:
---------------------------------

[~carp84] Why do you think we have not done the corresponding work for this 
purpose? Otherwise, why do you think I have to create FLINK-10966, is it not 
solved internally? The problem is that the community has been constantly 
developing, the code is constantly changing, many designs are obsolete and must 
be redesigned in the latest code, I am thinking Transfer the entire error 
handling logic to the CheckpointFailureManager, which did not exist before.

I admit that from a friendly point of view, I should ask first, but in these 
three issues, it is not necessarily the one that is not closed, I just want to 
make sure they are unified.

In addition, please do not make choices for others. If he wants to, he will 
discuss with me, and even we will conduct more in-depth discussions. But if you 
are not familiar with the progress of this module, please do not substitute for 
others to operate. How do you know that other people's work has not changed? 
Let him express his opinion.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: vinoyang
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Reply via email to