Re: JobGraphs not cleaned up in HA mode

Vijay Bhaskar Wed, 27 Nov 2019 22:43:06 -0800

Can you share the flink configuration once?

Regards
Bhaskar


On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200...@qq.com> wrote:

> if i clean the zookeeper data , it runs fine .  but next time when the
> jobmanager failed and redeploy the error occurs again
>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 下午3:05
> *收件人:* "曾祥才"<xcz200...@qq.com>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Again it could not find the state store file: "Caused by:
> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
>  Check why its unable to find.
> Better thing is: Clean up zookeeper state and check your configurations,
> correct them and restart cluster.
> Otherwise it always picks up corrupted state from zookeeper and it will
> never restart
>
> Regards
> Bhaskar
>
> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200...@qq.com> wrote:
>
>> i've made a misstake( the log before is another cluster) . the full
>> exception log is :
>>
>>
>> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      -
>> Recovering all persisted jobs.
>> 2019-11-28 02:33:12,726 INFO
>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
>> Starting the SlotManager.
>> 2019-11-28 02:33:12,743 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from
>> ZooKeeper.
>> 2019-11-28 02:33:12,744 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>> occurred in the cluster entrypoint.
>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
>> at
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
>> at
>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> Caused by: java.lang.RuntimeException:
>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates
>> that the retrieved state handle is broken. Try cleaning the state handle
>> store.
>> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>> at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>> at
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>> at
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>> ... 7 more
>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>> submitted JobGraph from state handle under
>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state
>> handle is broken. Try cleaning the state handle store.
>> at
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
>> at
>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
>> at
>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>> ... 9 more
>> Caused by: java.io.FileNotFoundException:
>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
>> at java.io.FileInputStream.open0(Native Method)
>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>> *抄送:* "User-Flink"<user@flink.apache.org>;
>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>
>> Is it filesystem or hadoop? If its NAS then why the exception "Caused by:
>> org.apache.hadoop.hdfs.BlockMissingException: "
>> It seems you configured hadoop state store and giving NAS mount.
>>
>> Regards
>> Bhaskar
>>
>>
>>
>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200...@qq.com> wrote:
>>
>>> /flink/checkpoints  is a external persistent store (a nas directory
>>> mounts to the job manager)
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>>> *发送时间:* 2019年11月28日(星期四) 下午2:29
>>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>>> *抄送:* "user"<user@flink.apache.org>;
>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>
>>> Following are the mandatory condition to run in HA:
>>>
>>> a) You should have persistent common external store for jobmanager and
>>> task managers to while writing the state
>>> b) You should have persistent external store for zookeeper to store the
>>> Jobgraph.
>>>
>>> Zookeeper is referring  path:
>>> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
>>> jobmanager unable to find it.
>>> It seems /flink/checkpoints  is not the external persistent store
>>>
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote:
>>>
>>>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>>>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>>>> remove submitted job info, but jobmanager remove the file):
>>>>
>>>>
>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>> submitted JobGraph from state handle under
>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved
>>>> state
>>>> handle is broken. Try cleaning the state handle store.
>>>>         at
>>>>
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>         at
>>>>
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>         at
>>>>
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>         ... 9 more
>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not
>>>> obtain
>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>>>         at
>>>>
>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>
>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to