flink 1.7 HA production setup going down completely

Manjusha Vuyyuru Tue, 07 May 2019 04:50:24 -0700

Hello,

I have a flink setup with two job managers coordinated by zookeeper.


I see the below exception and both jobmanagers are going down:

2019-05-07 08:29:13,346 INFO
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Released locks of job graph f8eb1b482d8ec8c1d3e94c4d0f79df77 from ZooKeeper.
2019-05-07 08:29:13,346 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -* Fatal
error occurred in the cluster entrypoint.*
java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not
retrieve submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
        at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
        at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more


Can someone please help me understand in detail on what is causing this
exception. I can see zookeeper not able to retrieve job graph. What could
be the reason for this?

This is second time that my setup is going down with this excepton, first
time i cleared jobgraph folder in zookeeper and restarted, now again faced
with same issue.

Since this is production setup this way of outage is not at all expected
:(. Can someone help me how to give a permanent fix to this issue?


Thanks,
Manju

flink 1.7 HA production setup going down completely

Reply via email to