Gary Yao created FLINK-9788: ------------------------------- Summary: ExecutionGraph Inconsistency prevents Job from recovering Key: FLINK-9788 URL: https://issues.apache.org/jira/browse/FLINK-9788 Project: Flink Issue Type: Bug Components: Core Affects Versions: 1.6.0 Environment: Rev: 4a06160 Hadoop 2.8.3 Reporter: Gary Yao Attachments: jobmanager_5000.log
Deployment mode: YARN job mode with HA After killing many TaskManagers in succession, the state of the ExecutionGraph ran into an inconsistent state, which prevented job recovery. The following stacktrace was logged in the JobManager log several hundred times per second: {noformat} -08 16:47:18,855 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched from state RESTARTING to RESTARTING. 2018-07-08 16:47:18,856 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the job General purpose test job (37a794195840700b98feb23e99f7ea24). 2018-07-08 16:47:18,857 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph - Resetting execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for new execution. 2018-07-08 16:47:18,857 WARN org.apache.flink.runtime.executiongraph.ExecutionGraph - Failed to restart the job. java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal state CREATED at org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573) at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251) at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59) at org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} The resulting jobmanager log file was 4.7 GB in size. Find attached the first 5000 lines of the log file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)