[ 
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845851#comment-16845851
 ] 

Till Rohrmann commented on FLINK-8871:
--------------------------------------

Sorry for not intervening earlier but I've just learned about this 
conversation. As a member of the Flink community I want to clearly express that 
I disapprove of such interactions and manners displayed in this thread. The 
discussion has gone far astray from where it should have been, namely a 
technical discussion about the best way how to fix the problem at hand. 

I want to remind all involved parties that we all have the same goal which is 
to improve Flink, help our users and build a welcoming and inclusive open 
source community. As established members of the Flink community we are directly 
responsible for shaping this community and should act accordingly. Please take 
this responsibility seriously!

If I see it correctly, then we haven't agreed on the design yet. Before 
discussing who will do the implementation I would suggest to first agree on the 
concrete design and implementation plan. In my opinion there are still some 
unanswered questions how the signal is propagated, for example. Until then I 
would suggest to not claim the issue and leave it unassigned.

Of course, it is unfortunate that both of you have already spent time 
implementing a potential fix. Hence, I can understand your frustration that 
some of the work might be redundant. For the future I would suggest that we 
first discuss and agree on a concrete solution before starting to implement it. 
Moreover, having the buy in from a committer will also minimize the risk that 
one is overlooking some aspects and makes the review and merge process much 
smoother because the committer will dedicate time to it.

> Checkpoint cancellation is not propagated to stop checkpointing threads on 
> the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: vinoyang
>            Priority: Critical
>
> Flink currently lacks any form of feedback mechanism from the job manager / 
> checkpoint coordinator to the tasks when it comes to failing a checkpoint. 
> This means that running snapshots on the tasks are also not stopped even if 
> their owning checkpoint is already cancelled. Two examples for cases where 
> this applies are checkpoint timeouts and local checkpoint failures on a task 
> together with a configuration that does not fail tasks on checkpoint failure. 
> Notice that those running snapshots do no longer account for the maximum 
> number of parallel checkpoints, because their owning checkpoint is considered 
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation 
> where the next checkpoints already started, while the abandoned checkpoint 
> thread from a previous checkpoint is still lingering around running. This 
> scenario can potentially cascade: many parallel checkpoints will slow down 
> checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as 
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway, 
> which is invoked by the checkpoint coordinator as part of cancelling the 
> checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to