[ https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299293#comment-17299293 ]
Zhu Zhu commented on FLINK-21707: --------------------------------- Agreed to "removing this logic from the IntermediateResult/IntermediateResultPartition". However, given that the problem of this ticket also exists in 1.11/1.12, how about we do a simple fix for this ticket first as the initial proposal and open a separate 1.13 ticket to change the step-wise scheduling logic? Regarding {{FatallyFailingSchedulerNGWrapper}}, I can see that currently exceptions are acceptable for some of the RPCs, e.g. {{requestNextInputSplit(), requestPartitionState()}} and I do not have confidence to find out all this kind of RPCs at the moment. If we always fail the job on exceptions, Flink might become more unstable than it is. Currently what I can only say that direct exceptions from {{updateTaskExecutionState()}} are not expected. So how about we just do try-catch for {{updateTaskExecutionState()}} in JobMaster? It can also work for all SchedulerNG implementations. And I also prefer it to be a separate task for 1.13 only to give it more exposure time. > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers > ----------------------------------------------------------------------------------------- > > Key: FLINK-21707 > URL: https://issues.apache.org/jira/browse/FLINK-21707 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.3, 1.12.2, 1.13.0 > Reporter: Zhu Zhu > Priority: Blocker > > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers. This is because > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to > schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, > even though the regions are not the exact consumers of the finished > *ExecutionVertex*. In this case, some of the regions can be in non-CREATED > state because they are not connected to nor affected by the restarted tasks. > However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not > allow to schedule a non-CREATED region and will throw an Exception and breaks > the scheduling of all the other regions. One example to show this problem > case can be found at > [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException > |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c]. > To fix the problem, we can add a filter in > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only > trigger the scheduling of regions in CREATED state. -- This message was sent by Atlassian Jira (v8.3.4#803005)