[ https://issues.apache.org/jira/browse/FLINK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aitozi updated FLINK-23871: --------------------------- Description: The exception during run recovery job will trigger fatal error which is introduced in https://issues.apache.org/jira/browse/FLINK-9097. If a job have reached a finished status. But crash at clean up phase or any other post phase. When recover job, it may recover a job in RunningJobsRegistry.JobSchedulingStatus.DONE status, this may lead to the dispatcher fatal again. I think we should deal with the RunningJobsRegistry.JobSchedulingStatus.DONE with special exception like JobFinishingException, which represents the job/master crashed in job finishing phase. And only do the clean up work for this exception was: The exception during run recovery job will trigger fatal error which is introduced in https://issues.apache.org/jira/browse/FLINK-9097. But if a job have reached a finished status. But crash at cleap up phase or any other post phase. When recover job, it may recover a job in RunningJobsRegistry.JobSchedulingStatus.DONE status, this may lead to the dispatcher fatal again. I think we should deal with the RunningJobsRegistry.JobSchedulingStatus.DONE with special exception like JobFinishingException, which represents the job/master crashed in job finishing phase. And only do the clean up work for this exception > Dispatcher should handle finishing job exception when recover > ------------------------------------------------------------- > > Key: FLINK-23871 > URL: https://issues.apache.org/jira/browse/FLINK-23871 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.13.2 > Reporter: Aitozi > Priority: Major > > The exception during run recovery job will trigger fatal error which is > introduced in https://issues.apache.org/jira/browse/FLINK-9097. If a job > have reached a finished status. But crash at clean up phase or any other post > phase. When recover job, it may recover a job in > RunningJobsRegistry.JobSchedulingStatus.DONE status, this may lead to the > dispatcher fatal again. > I think we should deal with the RunningJobsRegistry.JobSchedulingStatus.DONE > with special exception like JobFinishingException, which represents the > job/master crashed in job finishing phase. And only do the clean up work for > this exception -- This message was sent by Atlassian Jira (v8.3.4#803005)