Hello, I am running flink 1.6.1 standalone HA cluster. Today I am unable to start cluster because of "Fatal error in cluster entrypoint" (I used to see this error when running flink 1.5 version, after upgrade to 1.6.1 (which had a fix for this bug) everything worked well for a while)
Question: what exactly needs to be done to clean "state handle store"? 2018-11-15 15:09:53,181 DEBUG org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - Fencing token not set: Ignoring message LocalFencedMessage(null, org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the fencing token is null. 2018-11-15 15:09:53,182 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59) ... 9 more Caused by: java.io.FileNotFoundException: /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202) ... 14 more 2018-11-15 15:09:53,185 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache thank you, Olga