[ https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298880#comment-17298880 ]
Till Rohrmann commented on FLINK-21707: --------------------------------------- Whether to schedule it stage-wise or not could be a concern of the scheduler and not the underlying {{IntermediateResult}} and {{IntermediateResultPartition}}. Hence, I would be in favour of removing this logic from the {{IntermediateResult}}/{{IntermediateResultPartition}}. Maybe we can create a {{FatallyFailingSchedulerNGWrapper}} which does the failing if an exception occurs. That way it would work for all {{SchedulerNG}} implementations. > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers > ----------------------------------------------------------------------------------------- > > Key: FLINK-21707 > URL: https://issues.apache.org/jira/browse/FLINK-21707 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.3, 1.12.2, 1.13.0 > Reporter: Zhu Zhu > Priority: Blocker > > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers. This is because > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to > schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, > even though the regions are not the exact consumers of the finished > *ExecutionVertex*. In this case, some of the regions can be in non-CREATED > state because they are not connected to nor affected by the restarted tasks. > However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not > allow to schedule a non-CREATED region and will throw an Exception and breaks > the scheduling of all the other regions. One example to show this problem > case can be found at > [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException > |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c]. > To fix the problem, we can add a filter in > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only > trigger the scheduling of regions in CREATED state. -- This message was sent by Atlassian Jira (v8.3.4#803005)