Hi Manju, could you share the full logs or at least the full stack trace of the exception with us?
I suspect that after a failover Flink tries to restore the JobGraph from persistent storage (the directory which you have configured via `high-availability.storageDir`) but is not able to do so. One reason could be that the JobGraph file has been removed by a third party, for example. I think the cause of the FlinkException could shed light on it. Could you verify that the JobGraph file is still accessible? Cheers, Till On Wed, May 8, 2019 at 11:22 AM Manjusha Vuyyuru <vmanjusha....@gmail.com> wrote: > Any update on this from community side? > > On Tue, May 7, 2019 at 6:43 PM Manjusha Vuyyuru <vmanjusha....@gmail.com> > wrote: > >> im using 1.7.2. >> >> >> On Tue, May 7, 2019 at 5:50 PM miki haiat <miko5...@gmail.com> wrote: >> >>> Which flink version are you using? >>> I had similar issues with 1.5.x >>> >>> On Tue, May 7, 2019 at 2:49 PM Manjusha Vuyyuru <vmanjusha....@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> I have a flink setup with two job managers coordinated by zookeeper. >>>> >>>> I see the below exception and both jobmanagers are going down: >>>> >>>> 2019-05-07 08:29:13,346 INFO >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >>>> Released locks of job graph f8eb1b482d8ec8c1d3e94c4d0f79df77 from >>>> ZooKeeper. >>>> 2019-05-07 08:29:13,346 ERROR >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint -* Fatal >>>> error occurred in the cluster entrypoint.* >>>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could >>>> not retrieve submitted JobGraph from state handle under >>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state >>>> handle is broken. Try cleaning the state handle store. >>>> at >>>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >>>> at >>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74) >>>> at >>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) >>>> at >>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) >>>> at >>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) >>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) >>>> at >>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>>> submitted JobGraph from state handle under >>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state >>>> handle is broken. Try cleaning the state handle store. >>>> at >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) >>>> at >>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) >>>> ... 9 more >>>> >>>> >>>> Can someone please help me understand in detail on what is causing this >>>> exception. I can see zookeeper not able to retrieve job graph. What could >>>> be the reason for this? >>>> >>>> This is second time that my setup is going down with this excepton, >>>> first time i cleared jobgraph folder in zookeeper and restarted, now again >>>> faced with same issue. >>>> >>>> Since this is production setup this way of outage is not at all >>>> expected :(. Can someone help me how to give a permanent fix to this issue? >>>> >>>> >>>> Thanks, >>>> Manju >>>> >>>>