Hi all, I am getting similar exception while upgrading from Flink 1.4 to 1.6:
``` 06 Feb 2019 14:37:34,080 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /689f43070c701826e19ac24841050ea1. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /689f43070c701826e19ac24841050ea1. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) ... 9 more Caused by: java.io.InvalidClassException: org.apache.flink.runtime.jobgraph.tasks.CheckpointCoordinatorConfiguration; local class incompatible: stream classdesc serialVersionUID = -647384516034982626, local class serialVersionUID = 2 ``` Is it safe to clean Zookeeper state as it is suggested in logs? What kind of information I am losing? Thank you, Alexander On Fri, Nov 16, 2018 at 7:46 PM Olga Luganska <trebl...@hotmail.com> wrote: > Hi, Miki > > Thank you for reply! > > I have deleted zookeeper data and was able to restart cluster. > > Olga > > Sent from my iPhone > > On Nov 16, 2018, at 4:38 AM, miki haiat <miko5...@gmail.com> wrote: > > I "solved" this issue by cleaning the zookeeper information and start the > cluster again all the the checkpoint and job graph data will be erased and > basacly you will start a new cluster... > > It's happened to me allot on a 1.5.x > On a 1.6 things are running perfect . > I'm not sure way this error is back again on 1.6.1 ? > > > On Fri, 16 Nov 2018, 0:42 Olga Luganska <trebl...@hotmail.com wrote: > >> Hello, >> >> I am running flink 1.6.1 standalone HA cluster. Today I am unable to >> start cluster because of "Fatal error in cluster entrypoint" >> (I used to see this error when running flink 1.5 version, after upgrade >> to 1.6.1 (which had a fix for this bug) everything worked well for a while) >> >> Question: what exactly needs to be done to clean "state handle store"? >> >> 2018-11-15 15:09:53,181 DEBUG >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - Fencing >> token not set: Ignoring message LocalFencedMessage(null, >> org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because the >> fencing token is null. >> >> 2018-11-15 15:09:53,182 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >> occurred in the cluster entrypoint. >> >> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could >> not retrieve submitted JobGraph from state handle under >> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state >> handle is broken. Try cleaning the state handle store. >> >> at >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >> >> at >> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:61) >> >> at >> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) >> >> at >> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) >> >> at >> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) >> >> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) >> >> at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) >> >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >> submitted JobGraph from state handle under >> /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrieved state >> handle is broken. Try cleaning the state handle store. >> >> at >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) >> >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:692) >> >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:677) >> >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:658) >> >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:817) >> >> at >> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(FunctionUtils.java:59) >> >> ... 9 more >> >> Caused by: java.io.FileNotFoundException: >> /checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or >> directory) >> >> at java.io.FileInputStream.open0(Native Method) >> >> at java.io.FileInputStream.open(FileInputStream.java:195) >> >> at java.io.FileInputStream.<init>(FileInputStream.java:138) >> >> at >> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) >> >> at >> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) >> >> at >> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68) >> >> at >> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64) >> >> at >> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57) >> >> at >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202) >> >> ... 14 more >> >> 2018-11-15 15:09:53,185 INFO >> org.apache.flink.runtime.blob.TransientBlobCache - Shutting >> down BLOB cache >> >> >> thank you, >> >> Olga >> >>