[jira] [Commented] (FLINK-21707) Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers

Zhu Zhu (Jira) Wed, 10 Mar 2021 20:00:09 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299293#comment-17299293
 ]


Zhu Zhu commented on FLINK-21707:
---------------------------------

Agreed to "removing this logic from the 
IntermediateResult/IntermediateResultPartition". However, given that the 
problem of this ticket also exists in 1.11/1.12, how about we do a simple fix 
for this ticket first as the initial proposal and open a separate 1.13 ticket 
to change the step-wise scheduling logic? 

Regarding {{FatallyFailingSchedulerNGWrapper}}, I can see that currently 
exceptions are acceptable for some of the RPCs, e.g. {{requestNextInputSplit(), 
requestPartitionState()}} and I do not have confidence to find out all this 
kind of RPCs at the moment. If we always fail the job on exceptions, Flink 
might become more unstable than it is. Currently what I can only say that 
direct exceptions from {{updateTaskExecutionState()}} are not expected. So how 
about we just do try-catch for {{updateTaskExecutionState()}} in JobMaster? It 
can also work for all SchedulerNG implementations. And I also prefer it to be a 
separate task for 1.13 only to give it more exposure time.

> Job is possible to hang when restarting a FINISHED task with POINTWISE 
> BLOCKING consumers
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21707
>                 URL: https://issues.apache.org/jira/browse/FLINK-21707
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.3, 1.12.2, 1.13.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>
> Job is possible to hang when restarting a FINISHED task with POINTWISE 
> BLOCKING consumers. This is because 
> {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to 
> schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, 
> even though the regions are not the exact consumers of the finished 
> *ExecutionVertex*. In this case, some of the regions can be in non-CREATED 
> state because they are not connected to nor affected by the restarted tasks. 
> However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not 
> allow to schedule a non-CREATED region and will throw an Exception and breaks 
> the scheduling of all the other regions. One example to show this problem 
> case can be found at 
> [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException 
> |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].
> To fix the problem, we can add a filter in 
> {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only 
> trigger the scheduling of regions in CREATED state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21707) Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers

Reply via email to