I "solved" this issue by cleaning the zookeeper information and start the cluster again all the the checkpoint and job graph data will be erased and basacly you will start a new cluster...
It's happened to me allot on a 1.5.x On a 1.6 things are running perfect . I'm not sure way this error is back again on 1.6.1 ? On Fri, 16 Nov 2018, 0:42 Olga Luganska <trebl...@hotmail.com wrote: > Hello, > > I am running flink 1.6.1 standalone HA cluster. Today I am unable to start > cluster because of "Fatal error in cluster entrypoint" > (I used to see this error when running flink 1.5 version, after upgrade to > 1.6.1 (which had a fix for this bug) everything worked well for a while) > > Question: what exactly needs to be done to clean "state handle store"? > > 2018-11-15 15:09:53,181 DEBUG > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - Fencing > token not set: Ignoring message LocalFencedMessage(null, > org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the > fencing token is null. > > 2018-11-15 15:09:53,182 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error > occurred in the cluster entrypoint. > > java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could > not retrieve submitted JobGraph from state handle under > /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state > handle is broken. Try cleaning the state handle store. > > at > org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) > > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61) > > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) > > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Caused by: org.apache.flink.util.FlinkException: Could not retrieve > submitted JobGraph from state handle under > /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state > handle is broken. Try cleaning the state handle store. > > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) > > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692) > > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677) > > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658) > > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817) > > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59) > > ... 9 more > > Caused by: java.io.FileNotFoundException: > /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or > directory) > > at java.io.FileInputStream.open0(Native Method) > > at java.io.FileInputStream.open(FileInputStream.java:195) > > at java.io.FileInputStream.<init>(FileInputStream.java:138) > > at > org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) > > at > org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) > > at > org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68) > > at > org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64) > > at > org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57) > > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202) > > ... 14 more > > 2018-11-15 15:09:53,185 INFO > org.apache.flink.runtime.blob.TransientBlobCache - Shutting > down BLOB cache > > > thank you, > > Olga > >