[ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated FLINK-8871:
-------------------------
    Issue Type: Improvement  (was: Bug)

Merged into master via:
* f0ed29c06d331892a06ee9bddea4173d6300835d
* fae6a6cad3f4fe30d80cc9dd664b2efc72c0be36
* 1f6b1e6b8ffb38f196c2a1c6348cb96935e9a9cc

release-1.11 via:
* fa731288518c8ebf66f40b4e0e9b1929546b6257
* fcacc42e17f00cb47c5c16fe75af035f784ae1fa
* 6591263c8871898f602d122aac336f2ad63bbb1c

Thanks [~pnowojski] for merging this.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: Yun Tang
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to