[ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848149#comment-16848149
 ] 

Yun Tang commented on FLINK-8871:
---------------------------------

Thanks for [~srichter]'s comment. I agree we should fix FLINK-11662 first and 
it would benefit those issues [~yanghua] planed to track (FLINK-10724, 
FLINK-12209), not to mention FLINK-11662 has been tagged as the only [known 
issue|https://flink.apache.org/news/2019/04/09/release-1.8.0.html#known-issues] 
when releasing Flink-1.8.
[~StephanEwen]'s 
[comment|https://issues.apache.org/jira/browse/FLINK-10930?focusedCommentId=16712017&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16712017]
 left under FLINK-10930 should already provide valuable information of fixing 
FLINK-11662. IMO, [~yanghua]'s 
[PR-8322|https://github.com/apache/flink/pull/8322] could be benefited if we 
fix FLINK-11662 first.

When talking about the progress of our team done. We have already implemented a 
checkpoint cancellation mechanism with notify message and a cleanup mechanism 
on JM side to scan file systems to delete useless files periodically. From 
these issues, [~yanghua] might focus on the refactoring of checkpoint failure 
procedure. While we mainly focus on implementing functionality of checkpoint 
cancellation and useless checkpoint cleaner.



> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to