Re: flink 1.7 HA production setup going down completely

Till Rohrmann Wed, 08 May 2019 07:26:34 -0700

Hi Manju,

could you share the full logs or at least the full stack trace of the
exception with us?


I suspect that after a failover Flink tries to restore the JobGraph from
persistent storage (the directory which you have configured via
`high-availability.storageDir`) but is not able to do so. One reason could
be that the JobGraph file has been removed by a third party, for example. I
think the cause of the FlinkException could shed light on it. Could you
verify that the JobGraph file is still accessible?

Cheers,
Till

On Wed, May 8, 2019 at 11:22 AM Manjusha Vuyyuru <vmanjusha....@gmail.com>
wrote:

> Any update on this from community side?
>
> On Tue, May 7, 2019 at 6:43 PM Manjusha Vuyyuru <vmanjusha....@gmail.com>
> wrote:
>
>> im using 1.7.2.
>>
>>
>> On Tue, May 7, 2019 at 5:50 PM miki haiat <miko5...@gmail.com> wrote:
>>
>>> Which flink version are you using?
>>> I had similar  issues with 1.5.x
>>>
>>> On Tue, May 7, 2019 at 2:49 PM Manjusha Vuyyuru <vmanjusha....@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a flink setup with two job managers coordinated by zookeeper.
>>>>
>>>> I see the below exception and both jobmanagers are going down:
>>>>
>>>> 2019-05-07 08:29:13,346 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> Released locks of job graph f8eb1b482d8ec8c1d3e94c4d0f79df77 from 
>>>> ZooKeeper.
>>>> 2019-05-07 08:29:13,346 ERROR
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -* Fatal
>>>> error occurred in the cluster entrypoint.*
>>>> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could
>>>> not retrieve submitted JobGraph from state handle under
>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
>>>> handle is broken. Try cleaning the state handle store.
>>>>         at
>>>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>>>         at
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74)
>>>>         at
>>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>>>>         at
>>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>>>>         at
>>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>>>>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>>>>         at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>> submitted JobGraph from state handle under
>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
>>>> handle is broken. Try cleaning the state handle store.
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>         at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>         at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>         at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>         at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>         at
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>         ... 9 more
>>>>
>>>>
>>>> Can someone please help me understand in detail on what is causing this
>>>> exception. I can see zookeeper not able to retrieve job graph. What could
>>>> be the reason for this?
>>>>
>>>> This is second time that my setup is going down with this excepton,
>>>> first time i cleared jobgraph folder in zookeeper and restarted, now again
>>>> faced with same issue.
>>>>
>>>> Since this is production setup this way of outage is not at all
>>>> expected :(. Can someone help me how to give a permanent fix to this issue?
>>>>
>>>>
>>>> Thanks,
>>>> Manju
>>>>
>>>>

Re: flink 1.7 HA production setup going down completely

Reply via email to