[ https://issues.apache.org/jira/browse/FLINK-21030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268409#comment-17268409 ]
Zhu Zhu edited comment on FLINK-21030 at 1/20/21, 7:21 AM: ----------------------------------------------------------- Agreed to trigger a global failover to bring FINISHED tasks back to RUNNING if stop-with-savepoint fails. Maybe right after the stopped checkpoint scheduler is restarted in {{SchedulerBase#stopWithSavepoint()}}. was (Author: zhuzh): Agreed to trigger a global failover to bring FINISHED tasks back to RUNNING if stop-with-savepoint fails. Maybe right after that the stopped checkpoint scheduler is restarted in {{SchedulerBase#stopWithSavepoint()}}. > Broken job restart for job with disjoint graph > ---------------------------------------------- > > Key: FLINK-21030 > URL: https://issues.apache.org/jira/browse/FLINK-21030 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.2 > Reporter: Theo Diefenthal > Priority: Blocker > Fix For: 1.13.0, 1.11.4, 1.12.2 > > > Building on top of bugs: > https://issues.apache.org/jira/browse/FLINK-21028 > and https://issues.apache.org/jira/browse/FLINK-21029 : > I tried to stop a Flink application on YARN via savepoint which didn't > succeed due to a possible bug/racecondition in shutdown (Bug 21028). Due to > some reason, Flink attempted to restart the pipeline after the failure in > shutdown (21029). The bug here: > As I mentioned: My jobgraph is disjoint and the pipelines are fully isolated. > Lets say the original error occured in a single task of pipeline1. Flink then > restarted the entire pipeline1, but pipeline2 was shutdown successfully and > switched the state to FINISHED. > My job thus was in kind of an invalid state after the attempt to stopping: > One of two pipelines was running, the other was FINISHED. I guess this is > kind of a bug in the restarting behavior that only all connected components > of a graph are restarted, but the others aren't... -- This message was sent by Atlassian Jira (v8.3.4#803005)