Re: JobGraphs not cleaned up in HA mode

vino yang Thu, 28 Nov 2019 03:18:25 -0800

Hi,

Why do you not use HDFS directly?


Best,
Vino

曾祥才 <xcz200...@qq.com> 于2019年11月28日周四 下午6:48写道：

>
> anyone have the same problem？ pls help, thks
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "曾祥才"<xcz200...@qq.com>;
> *发送时间:* 2019年11月28日(星期四) 下午2:46
> *收件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
> *抄送:* "User-Flink"<user@flink.apache.org>;
> *主题:* 回复： JobGraphs not cleaned up in HA mode
>
> the config  (/flink is the NASdirectory ):
>
> jobmanager.rpc.address: flink-jobmanager
> taskmanager.numberOfTaskSlots: 16
> web.upload.dir: /flink/webUpload
> blob.server.port: 6124
> jobmanager.rpc.port: 6123
> taskmanager.rpc.port: 6122
> jobmanager.heap.size: 1024m
> taskmanager.heap.size: 1024m
> high-availability: zookeeper
> high-availability.cluster-id: /cluster-test
> high-availability.storageDir: /flink/ha
> high-availability.zookeeper.quorum: ****:2181
> high-availability.jobmanager.port: 6123
> high-availability.zookeeper.path.root: /flink/risk-insight
> high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
> state.backend: filesystem
> state.checkpoints.dir: file:///flink/checkpoints
> state.savepoints.dir: file:///flink/savepoints
> state.checkpoints.num-retained: 2
> jobmanager.execution.failover-strategy: region
> jobmanager.archive.fs.dir: file:///flink/archive/history
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 下午3:12
> *收件人:* "曾祥才"<xcz200...@qq.com>;
> *抄送:* "User-Flink"<user@flink.apache.org>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Can you share the flink configuration once?
>
> Regards
> Bhaskar
>
> On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200...@qq.com> wrote:
>
>> if i clean the zookeeper data , it runs fine .  but next time when the
>> jobmanager failed and redeploy the error occurs again
>>
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午3:05
>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>
>> Again it could not find the state store file: "Caused by:
>> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
>>  Check why its unable to find.
>> Better thing is: Clean up zookeeper state and check your configurations,
>> correct them and restart cluster.
>> Otherwise it always picks up corrupted state from zookeeper and it will
>> never restart
>>
>> Regards
>> Bhaskar
>>
>> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200...@qq.com> wrote:
>>
>>> i've made a misstake( the log before is another cluster) . the full
>>> exception log is :
>>>
>>>
>>> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      -
>>> Recovering all persisted jobs.
>>> 2019-11-28 02:33:12,726 INFO
>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
>>> Starting the SlotManager.
>>> 2019-11-28 02:33:12,743 INFO
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from
>>> ZooKeeper.
>>> 2019-11-28 02:33:12,744 ERROR
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>>> occurred in the cluster entrypoint.
>>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
>>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
>>> at
>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>>>
>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>>>
>>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>
>>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>
>>> Caused by: java.lang.RuntimeException:
>>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
>>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates
>>> that the retrieved state handle is broken. Try cleaning the state handle
>>> store.
>>> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>> at
>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>>> at
>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>>>
>>> ... 7 more
>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>> submitted JobGraph from state handle under
>>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state
>>> handle is broken. Try cleaning the state handle store.
>>> at
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
>>>
>>> at
>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
>>>
>>> at
>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>>>
>>> ... 9 more
>>> Caused by: java.io.FileNotFoundException:
>>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
>>> at java.io.FileInputStream.open0(Native Method)
>>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>>> *抄送:* "User-Flink"<user@flink.apache.org>;
>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>
>>> Is it filesystem or hadoop? If its NAS then why the exception "Caused
>>> by: org.apache.hadoop.hdfs.BlockMissingException: "
>>> It seems you configured hadoop state store and giving NAS mount.
>>>
>>> Regards
>>> Bhaskar
>>>
>>>
>>>
>>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200...@qq.com> wrote:
>>>
>>>> /flink/checkpoints  is a external persistent store (a nas directory
>>>> mounts to the job manager)
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>>>> *发送时间:* 2019年11月28日(星期四) 下午2:29
>>>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>>>> *抄送:* "user"<user@flink.apache.org>;
>>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>>
>>>> Following are the mandatory condition to run in HA:
>>>>
>>>> a) You should have persistent common external store for jobmanager and
>>>> task managers to while writing the state
>>>> b) You should have persistent external store for zookeeper to store the
>>>> Jobgraph.
>>>>
>>>> Zookeeper is referring  path:
>>>> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
>>>> jobmanager unable to find it.
>>>> It seems /flink/checkpoints  is not the external persistent store
>>>>
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote:
>>>>
>>>>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>>>>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>>>>> remove submitted job info, but jobmanager remove the file):
>>>>>
>>>>>
>>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>>> submitted JobGraph from state handle under
>>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved
>>>>> state
>>>>> handle is broken. Try cleaning the state handle store.
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>>         at
>>>>>
>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>>         at
>>>>>
>>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>>         ... 9 more
>>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not
>>>>> obtain
>>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>>>>         at
>>>>>
>>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from:
>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>
>>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to