Which flink version are you using? I had similar issues with 1.5.x On Tue, May 7, 2019 at 2:49 PM Manjusha Vuyyuru <vmanjusha....@gmail.com> wrote:
> Hello, > > I have a flink setup with two job managers coordinated by zookeeper. > > I see the below exception and both jobmanagers are going down: > > 2019-05-07 08:29:13,346 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Released locks of job graph f8eb1b482d8ec8c1d3e94c4d0f79df77 from ZooKeeper. > 2019-05-07 08:29:13,346 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint -* Fatal > error occurred in the cluster entrypoint.* > java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could > not retrieve submitted JobGraph from state handle under > /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state > handle is broken. Try cleaning the state handle store. > at > org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.util.FlinkException: Could not retrieve > submitted JobGraph from state handle under > /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state > handle is broken. Try cleaning the state handle store. > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) > at > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) > ... 9 more > > > Can someone please help me understand in detail on what is causing this > exception. I can see zookeeper not able to retrieve job graph. What could > be the reason for this? > > This is second time that my setup is going down with this excepton, first > time i cleared jobgraph folder in zookeeper and restarted, now again faced > with same issue. > > Since this is production setup this way of outage is not at all expected > :(. Can someone help me how to give a permanent fix to this issue? > > > Thanks, > Manju > >