Gary Yao created FLINK-9788:
-------------------------------

             Summary: ExecutionGraph Inconsistency prevents Job from recovering
                 Key: FLINK-9788
                 URL: https://issues.apache.org/jira/browse/FLINK-9788
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.6.0
         Environment: Rev: 4a06160
Hadoop 2.8.3
            Reporter: Gary Yao
         Attachments: jobmanager_5000.log

Deployment mode: YARN job mode with HA

After killing many TaskManagers in succession, the state of the ExecutionGraph 
ran into an inconsistent state, which prevented job recovery. The following 
stacktrace was logged in the JobManager log several hundred times per second:
{noformat}
-08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
     - Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched 
from state RESTARTING to RESTARTING.
2018-07-08 16:47:18,856 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the 
job General purpose test job (37a794195840700b98feb23e99f7ea24).
2018-07-08 16:47:18,857 DEBUG 
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting 
execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for new 
execution.
2018-07-08 16:47:18,857 WARN  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to 
restart the job.
java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal 
state CREATED
        at 
org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
        at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
        at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
        at 
org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
        at 
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

The resulting jobmanager log file was 4.7 GB in size. Find attached the first 
5000 lines of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to