[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

vinoyang (JIRA) Wed, 22 May 2019 06:04:31 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845852#comment-16845852
 ]


vinoyang commented on FLINK-8871:
---------------------------------

[~carp84] What is my behavior similar to yours? What vulgar vocabulary did I 
use in communicating with you? You can point it out, I could humbly change it. 
Can you point it out? I don't care whether you use quotes or not, I never use 
this vocabulary to others in the community. This is a kind of slander, no doubt!

I let him wait, he can also choose to set the assignee to him, whether this is 
due to respect is just from your guess. I have said I didn't stop him working 
for this issue. I just didn't explain my action, I admitted this question. Does 
this need to rise to any moral height level? Please take a look at your 
personal comments from beginning to end. I try to focus on explaining and 
commenting on the problem itself, and you always comment on me in a roundabout 
way. Is this the way you participate in the community?

If the suggestion of waiting is my own conclusion, then you can speculate on me 
at random, I have no objection, but I have stated that this is a suggestion 
from Stephan. Otherwise, before Yun's comment, maybe I have already submitted 
PR (you should know it was November 2018).

Why this issue's PR does not need to wait? Do you think Stephan's suggestion is 
unfounded? And please take a look at [Yun tang's own 
comment|https://issues.apache.org/jira/browse/FLINK-8871?focusedCommentId=16789755&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16789755],
 these two APIs are similar. If one needs to be refactored because of 
FLINK-12477, based on the existing implementation, the other one does not need 
to? I really don't know how to explain it.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Reply via email to