[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Yu Li (JIRA) Tue, 21 May 2019 21:37:15 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845507#comment-16845507
 ]


Yu Li commented on FLINK-8871:
------------------------------

bq. Yun Tang Your solution sounds good, but it would wait for other things to 
be done, more details: discussion under FLINK-10966.
Please directly answer my question, that why do you ask Yun to wait and then 
assign the JIRA to yourself? Do you *really* think this is a correct action? If 
so, where is your credit? [~yanghua]

bq. How do you know that other people's work has not changed? Let him express 
his opinion.
I believe now you know the answer and won't repeat.

bq. Why do you think we have not done the corresponding work for this purpose?
If you really don't know why I mentioned that we have run it in production for 
a long time, let me clarify: we are well prepared to upstream our work and the 
patch was long ready, but since you asked us to wait, we showed our respect, 
but never imagine one could behave like this. This is definitely a "lesson" and 
we will "learn" from it.

That's that and I have no interest to be involved in such quarrel further. And 
to prevent polluting the JIRA history we won't reassign it again until the work 
is done. My advice is to watch your behavior boy and try to learn how to work 
with others with respect in the open source world (with my HBase PMC hat on) 
[~yanghua].

[~yunta] please prepare and submit the PR and let's make it in ASAP.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: vinoyang
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

Reply via email to