[jira] [Comment Edited] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

vinoyang (JIRA) Sat, 25 May 2019 05:13:44 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848170#comment-16848170
 ]


vinoyang edited comment on FLINK-8871 at 5/25/19 12:12 PM:
-----------------------------------------------------------

Hi [~srichter] First of all, there are some non-theme related discussions on 
this issue, and I am one of the participants. In this regard, I am sorry about 
this and we should try our best not to let this happen again. Thank you for 
helping to sort out the relationship and priority between these issues. I 
basically agree with you. There is only one question that I don't know if we 
can revisit it.

I am not denying the importance of FLINK-11662, but it seems that it can be 
paralleled with the checkpoint failure processing logic. Although FLINK-11662 
will also cause the job to fail, it will not affect the semantics of this 
{{CheckpointFailureManager}} (the failure and recovery of the Job will cause 
the {{CheckpointFailureManager}} to start working again). And, I think these 
three steps which have been contained will make a lot of work in simplifying 
the checkpoint related code logic and other work will also benefit.

I am currently implementing a refactoring of checkpoint failure processing, I 
will give more details here. Initially, it was designed to meet the need for 
more flexible failure-tolerant conditions, such as tolerating consecutive n 
checkpoint failures. But over time, it contains a larger theme and is divided 
into three parts:
 * FLINK-10724: Reconstruction of checkpoint related classes: no longer 
distinguish between the trigger phase and the execution phase, introduce a 
unified exception class {{CheckpointExecution}}, and enumerate all possible 
failure causes CheckpointFailureReason; refactor 
{{PendingCheckpoint#abortXXX}}. In fact, when I implemented this 
[PR-8322|https://github.com/apache/flink/pull/8322], I found that I also missed 
a bunch of decline related exceptions, I hope it will be completed in the 
second step;
 * FLINK-12364: Introduced checkpoint failure manager: We simplified and 
refactored many configurations and introduced a simple counting mechanism; 
currently there are two problems with this 
[PR|https://github.com/apache/flink/pull/8322] (1) It has not been compatible 
with failTaskOnCheckpointError , which makes me aware our first step's work has 
not been completely finished; (2) the mechanism of failJob may be problematic 
and needs further confirmation; in addition, I also believe that more designs 
in the future will benefit from {{CheckpointFailureManager}}, which can take 
more responsibility and separated more logic from the CheckpointCoordinator, as 
mentioned in my previous comments;
 * FLINK-12209: The TM is no longer allowed to explicitly fail job due to the 
configuration of the checkpoint. This step is blocked by the second step, which 
will allow us to discard the configuration of failTaskOnCheckpointError. From 
then on, the TM side will not be able to explicitly fail the job based on this 
configuration. This will benefit other issues.

 

Without a doubt, I agree with the importance of FLINK-11662. My question is 
whether we can reconsider them to be parallelly processed. Because the 
refactoring is currently in an intermediate stage (and the second step will 
continue to process the first step's residual work).

In fact, we have done a similar job as [~yunta]'s team has done, as the 
previous comment said. In view of some of the discordant factors caused by this 
issue, I will not fight for this task. In addition, due to other issues, I have 
no time to pay attention to FLINK-11662, maybe [~yunta] has time to deal with 
it.

I will take the time to give more details of the checkpoint cleaner as soon as 
possible. I think we all have a consensus: FLINK-10855 will inevitably be 
blocked by other work, so its priority is not currently the highest.


was (Author: yanghua):
Hi [~srichter] First of all, there are some non-theme related discussions on 
this issue, and I am one of the participants. In this regard, I am sorry about 
this and we should try our best not to let this happen again. Thank you for 
helping to sort out the relationship and priority between these issues. I 
basically agree with you. There is only one question that I don't know if we 
can revisit it.

I am not denying the importance of FLINK-11662, but it seems that it can be 
paralleled with the checkpoint failure processing logic. Although FLINK-11662 
will also cause the job to fail, it will not affect the semantics of this 
{{CheckpointFailureManager}} (the failure and recovery of the Job will cause 
the {{CheckpointFailureManager}} to start working again). And, I think these 
three steps which have been contained will make a lot of work in simplifying 
the checkpoint related code logic and other work will also benefit.

I am currently implementing a refactoring of checkpoint failure processing, I 
will give more details here. Initially, it was designed to meet the need for 
more flexible failure-tolerant conditions, such as tolerating consecutive n 
checkpoint failures. But over time, it contains a larger theme and is divided 
into three parts:
 * FLINK-10724: Reconstruction of checkpoint related classes: no longer 
distinguish between the trigger phase and the execution phase, introduce a 
unified exception class {{CheckpointExecution}}, and enumerate all possible 
failure causes CheckpointFailureReason; refactor 
{{PendingCheckpoint#abortXXX}}. In fact, when I implemented this 
[PR-8322|https://github.com/apache/flink/pull/8322], I found that I also missed 
a bunch of decline related exceptions, I hope it will be completed in the 
second step;
 * FLINK-12364: Introduced checkpoint failure manager: We simplified and 
refactored many configurations and introduced a simple counting mechanism; 
currently there are two problems with this 
[PR|https://github.com/apache/flink/pull/8322] (1) It has not been compatible 
with failTaskOnCheckpointError , which makes me aware our first step's work has 
not been completely finished; (2) the mechanism of failJob may be problematic 
and needs further confirmation; in addition, I also believe that more designs 
in the future will benefit from {{CheckpointFailureManager}}, which can take 
more responsibility and separated more logic from the CheckpointCoordinator, as 
mentioned in my previous comments;
 * FLINK-12209: The TM is no longer allowed to explicitly fail job due to the 
configuration of the checkpoint. This step is blocked by the second step, which 
will allow us to discard the configuration of failTaskOnCheckpointError. From 
then on, the TM side will not be able to explicitly fail the job based on this 
configuration. This will benefit other issues.

 

Without a doubt, I agree with the importance of FLINK-11662. My question is 
whether we can reconsider them to be parallelly processed. Because the 
refactoring is currently in an intermediate stage (and the second step will 
continue to process the first step's residual work).

In fact, we have done a similar job as [~yunta]'s team has done, as the 
previous comment said. Just, you know, I am still discussing the issue of the 
other two checkpoints with Gyula.

In view of some of the discordant factors caused by this issue, I will not 
fight for this task. In addition, there is no doubt that I admit that 
especially the issue of multiple issues, I have no time to pay attention to 
FLINK-11662, maybe [~yunta] has time to deal with it.

I will take the time to give more details of the checkpoint cleaner as soon as 
possible. I think we all have a consensus: FLINK-10855 will inevitably be 
blocked by other work, so its priority is not currently the highest.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Reply via email to