Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

Olga Luganska Thu, 15 Nov 2018 14:42:51 -0800

Hello,

I am running flink 1.6.1 standalone HA cluster. Today I am unable to start 
cluster because of "Fatal error in cluster entrypoint"
(I used to see this error when running flink 1.5 version, after upgrade to 
1.6.1 (which had a fix for this bug) everything worked well for a while)


Question: what exactly needs to be done to clean "state handle store"?


2018-11-15 15:09:53,181 DEBUG 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor          - Fencing token 
not set: Ignoring message LocalFencedMessage(null, 
org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the fencing 
token is null.

2018-11-15 15:09:53,182 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error 
occurred in the cluster entrypoint.

java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not 
retrieve submitted JobGraph from state handle under 
/e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state 
handle is broken. Try cleaning the state handle store.

        at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)

        at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61)

        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)

        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)

        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)

        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This 
indicates that the retrieved state handle is broken. Try cleaning the state 
handle store.

        at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)

        at 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692)

        at 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677)

        at 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658)

        at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817)

        at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59)

        ... 9 more

Caused by: java.io.FileNotFoundException: 
/checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or directory)

        at java.io.FileInputStream.open0(Native Method)

        at java.io.FileInputStream.open(FileInputStream.java:195)

        at java.io.FileInputStream.<init>(FileInputStream.java:138)

        at 
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)

        at 
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)

        at 
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)

        at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)

        at 
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)

        at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202)

        ... 14 more

2018-11-15 15:09:53,185 INFO  org.apache.flink.runtime.blob.TransientBlobCache  
            - Shutting down BLOB cache


thank you,

Olga

Standalone HA cluster: Fatal error occurred in the cluster entrypoint.

Reply via email to