[ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863760#comment-16863760 ]
Piotr Nowojski edited comment on FLINK-8871 at 6/14/19 7:08 AM: ---------------------------------------------------------------- Thanks for bringing this to my attention [~yunta]. Unfortunately there are some more conflicts in the pipeline. Probably nothing that is completely incompatible but we are working on the same classes. In order to finalize [~sunhaibotb]'s `TwoInputStreamSelectable`, I have to rework `CheckpointBarrierHandler` implementation quite a bit. I two biggest sources of conflicts: # I'm splitting the existing code of {{BarrierBuffer}} into two separate classes - I'm moving the code that actually performs the alignment to a separate class {{CheckpointBarrierAligner}}, leaving {{BarrierBuffer}} a quite thin class. This is in order to support two separate instances of {{BarrierBuffer}} for {{TwoInputStreamOperator}} - that means the code tracking alignment, blocking channels etc must be extracted two another class, that will be shared by those two {{BarrierBufffer}} instances. # I'm planning to deduplicate the code by a bit and completely remove {{BarrierTracker}} class and unify it with {{BarrierBuffer}}. You can even see this problem in your PR, where you are implementing handling of {{private final NavigableSet<Long> abortedCheckpointIds;}} twice in those two classes. was (Author: pnowojski): Thanks for bringing this to my attention [~yunta]. Unfortunately there are some more conflicts in the pipeline. Probably nothing that is completely incompatible but we are working on the same classes. In order to finalize [~sunhaibotb]'s `TwoInputStreamSelectable`, I have to rework `CheckpointBarrierHandler` implementation quite a bit. I two biggest sources of conflicts: 1. I'm splitting the existing code of {{BarrierBuffer}} into two separate classes - I'm moving the code that actually performs the alignment to a separate class {{CheckpointBarrierAligner}}, leaving {{BarrierBuffer}} a quite thin class. This is in order to support two separate instances of {{BarrierBuffer}} for {{TwoInputStreamOperator}} - that means the code tracking alignment, blocking channels etc must be extracted two another class, that will be shared by those two {{BarrierBufffer}} instances. 2. I'm planning to deduplicate the code by a bit and completely remove {{BarrierTracker}} class and unify it with {{BarrierBuffer}}. You can even see this problem in your PR, where you are implementing handling of {{private final NavigableSet<Long> abortedCheckpointIds;}} twice in those two classes. > Checkpoint cancellation is not propagated to stop checkpointing threads on > the task manager > ------------------------------------------------------------------------------------------- > > Key: FLINK-8871 > URL: https://issues.apache.org/jira/browse/FLINK-8871 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0 > Reporter: Stefan Richter > Assignee: Yun Tang > Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Flink currently lacks any form of feedback mechanism from the job manager / > checkpoint coordinator to the tasks when it comes to failing a checkpoint. > This means that running snapshots on the tasks are also not stopped even if > their owning checkpoint is already cancelled. Two examples for cases where > this applies are checkpoint timeouts and local checkpoint failures on a task > together with a configuration that does not fail tasks on checkpoint failure. > Notice that those running snapshots do no longer account for the maximum > number of parallel checkpoints, because their owning checkpoint is considered > as cancelled. > Not stopping the task's snapshot thread can lead to a problematic situation > where the next checkpoints already started, while the abandoned checkpoint > thread from a previous checkpoint is still lingering around running. This > scenario can potentially cascade: many parallel checkpoints will slow down > checkpointing and make timeouts even more likely. > > A possible solution is introducing a {{cancelCheckpoint}} method as > counterpart to the {{triggerCheckpoint}} method in the task manager gateway, > which is invoked by the checkpoint coordinator as part of cancelling the > checkpoint. -- This message was sent by Atlassian JIRA (v7.6.3#76005)