[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Yu Li (JIRA) Wed, 22 May 2019 05:38:31 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845835#comment-16845835
 ]


Yu Li commented on FLINK-8871:
------------------------------

bq. Can the word robbery be used at will?
There's obvious a quote in my comment but it seems you choose to ignore it. 
Fine anyway, your behavior is pretty much like that from my point of view.

bq. Am I replacing "assignee" from Yun Tang to mine?
Yun showed enough respect and left the assignee to empty when you asked him to 
wait. Of course you could use his kindness as your weapon, if only you think 
it's proper. And as I mentioned above, of course we will learn from the lesson.

bq. Unfortunately, if you understand the details, then you will not say so.
Please just point out which of my points is inaccurate, or answer my question 
directly, that why do you think the change on producer/consumer have to wait 
for the change of message queue implementation? Do you really understand the 
whole picture?

bq.  FLINK-12477 includes the refactor of notifyComplete, but 
notifyCheckpointAbort is similar to the implementation to notifyComplete. How 
did you come to a conclusion that they have no relationship?
{{notifyCheckpointComplete}} is an already implemented method and need to 
rebase its event-publishing/consuming logic to the new design. 
{{notifyCheckpointAbort}} is something not implemented yet. They are clearly 
different and easy to tell.

bq. I can only ask you not to comment on me.
Behave right or bear the comments, I'd say.

I think I've clearly expressed my points and will stop responding. We will 
submit our PR and leave the judgement to the community. Thanks.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: vinoyang
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Reply via email to