[ https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298608#comment-17298608 ]
Zhu Zhu commented on FLINK-21707: --------------------------------- cc [~trohrmann][~mapohl] > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers > ----------------------------------------------------------------------------------------- > > Key: FLINK-21707 > URL: https://issues.apache.org/jira/browse/FLINK-21707 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.3, 1.12.2, 1.13.0 > Reporter: Zhu Zhu > Priority: Blocker > > Job is possible to hang when restarting a FINISHED task with POINTWISE > BLOCKING consumers. This is because > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to > schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, > even though the regions are not the exact consumers of the finished > *ExecutionVertex*. In this case, some of the regions can be in state other > than CREATED because they are not connected to and affected by the restarted > tasks. However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} > does not allow to schedule a non-CREATED region and will throw an Exception > and breaks the scheduling of all the other regions. One example to show this > problem case can be found at > [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException > |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c]. > To fix the problem, we can add a filter in > {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only > trigger the scheduling of regions in CREATED state. -- This message was sent by Atlassian Jira (v8.3.4#803005)