Hi, Why do you not use HDFS directly?
Best, Vino 曾祥才 <xcz200...@qq.com> 于2019年11月28日周四 下午6:48写道: > > anyone have the same problem? pls help, thks > > > > ------------------ 原始邮件 ------------------ > *发件人:* "曾祥才"<xcz200...@qq.com>; > *发送时间:* 2019年11月28日(星期四) 下午2:46 > *收件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; > *抄送:* "User-Flink"<user@flink.apache.org>; > *主题:* 回复: JobGraphs not cleaned up in HA mode > > the config (/flink is the NASdirectory ): > > jobmanager.rpc.address: flink-jobmanager > taskmanager.numberOfTaskSlots: 16 > web.upload.dir: /flink/webUpload > blob.server.port: 6124 > jobmanager.rpc.port: 6123 > taskmanager.rpc.port: 6122 > jobmanager.heap.size: 1024m > taskmanager.heap.size: 1024m > high-availability: zookeeper > high-availability.cluster-id: /cluster-test > high-availability.storageDir: /flink/ha > high-availability.zookeeper.quorum: ****:2181 > high-availability.jobmanager.port: 6123 > high-availability.zookeeper.path.root: /flink/risk-insight > high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints > state.backend: filesystem > state.checkpoints.dir: file:///flink/checkpoints > state.savepoints.dir: file:///flink/savepoints > state.checkpoints.num-retained: 2 > jobmanager.execution.failover-strategy: region > jobmanager.archive.fs.dir: file:///flink/archive/history > > > > ------------------ 原始邮件 ------------------ > *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; > *发送时间:* 2019年11月28日(星期四) 下午3:12 > *收件人:* "曾祥才"<xcz200...@qq.com>; > *抄送:* "User-Flink"<user@flink.apache.org>; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Can you share the flink configuration once? > > Regards > Bhaskar > > On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200...@qq.com> wrote: > >> if i clean the zookeeper data , it runs fine . but next time when the >> jobmanager failed and redeploy the error occurs again >> >> >> >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >> *发送时间:* 2019年11月28日(星期四) 下午3:05 >> *收件人:* "曾祥才"<xcz200...@qq.com>; >> *主题:* Re: JobGraphs not cleaned up in HA mode >> >> Again it could not find the state store file: "Caused by: >> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 " >> Check why its unable to find. >> Better thing is: Clean up zookeeper state and check your configurations, >> correct them and restart cluster. >> Otherwise it always picks up corrupted state from zookeeper and it will >> never restart >> >> Regards >> Bhaskar >> >> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200...@qq.com> wrote: >> >>> i've made a misstake( the log before is another cluster) . the full >>> exception log is : >>> >>> >>> INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - >>> Recovering all persisted jobs. >>> 2019-11-28 02:33:12,726 INFO >>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - >>> Starting the SlotManager. >>> 2019-11-28 02:33:12,743 INFO >>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from >>> ZooKeeper. >>> 2019-11-28 02:33:12,744 ERROR >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >>> occurred in the cluster entrypoint. >>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take >>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) >>> >>> at >>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >>> >>> at >>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >>> >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >>> >>> at >>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) >>> at >>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) >>> >>> at >>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) >>> >>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) >>> at >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) >>> >>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at >>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>> >>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> at >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >>> Caused by: java.lang.RuntimeException: >>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph >>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates >>> that the retrieved state handle is broken. Try cleaning the state handle >>> store. >>> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >>> at >>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) >>> >>> at >>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) >>> at >>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) >>> >>> ... 7 more >>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>> submitted JobGraph from state handle under >>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state >>> handle is broken. Try cleaning the state handle store. >>> at >>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) >>> >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) >>> >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) >>> >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) >>> >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) >>> >>> at >>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) >>> >>> ... 9 more >>> Caused by: java.io.FileNotFoundException: >>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) >>> at java.io.FileInputStream.open0(Native Method) >>> at java.io.FileInputStream.open(FileInputStream.java:195) >>> >>> >>> >>> >>> ------------------ 原始邮件 ------------------ >>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >>> *发送时间:* 2019年11月28日(星期四) 下午2:46 >>> *收件人:* "曾祥才"<xcz200...@qq.com>; >>> *抄送:* "User-Flink"<user@flink.apache.org>; >>> *主题:* Re: JobGraphs not cleaned up in HA mode >>> >>> Is it filesystem or hadoop? If its NAS then why the exception "Caused >>> by: org.apache.hadoop.hdfs.BlockMissingException: " >>> It seems you configured hadoop state store and giving NAS mount. >>> >>> Regards >>> Bhaskar >>> >>> >>> >>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200...@qq.com> wrote: >>> >>>> /flink/checkpoints is a external persistent store (a nas directory >>>> mounts to the job manager) >>>> >>>> >>>> >>>> >>>> ------------------ 原始邮件 ------------------ >>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >>>> *发送时间:* 2019年11月28日(星期四) 下午2:29 >>>> *收件人:* "曾祥才"<xcz200...@qq.com>; >>>> *抄送:* "user"<user@flink.apache.org>; >>>> *主题:* Re: JobGraphs not cleaned up in HA mode >>>> >>>> Following are the mandatory condition to run in HA: >>>> >>>> a) You should have persistent common external store for jobmanager and >>>> task managers to while writing the state >>>> b) You should have persistent external store for zookeeper to store the >>>> Jobgraph. >>>> >>>> Zookeeper is referring path: >>>> /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but >>>> jobmanager unable to find it. >>>> It seems /flink/checkpoints is not the external persistent store >>>> >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote: >>>> >>>>> hi ,I've the same problem with flink 1.9.1 , any solution to fix it >>>>> when the k8s redoploy jobmanager , the error looks like (seems zk not >>>>> remove submitted job info, but jobmanager remove the file): >>>>> >>>>> >>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>>>> submitted JobGraph from state handle under >>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved >>>>> state >>>>> handle is broken. Try cleaning the state handle store. >>>>> at >>>>> >>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) >>>>> at >>>>> >>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) >>>>> at >>>>> >>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) >>>>> at >>>>> >>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) >>>>> at >>>>> >>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) >>>>> at >>>>> >>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) >>>>> ... 9 more >>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not >>>>> obtain >>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494 >>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed >>>>> at >>>>> >>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052) >>>>> >>>>> >>>>> >>>>> -- >>>>> Sent from: >>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >>>>> >>>>