Can you share the flink configuration once? Regards Bhaskar
On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200...@qq.com> wrote: > if i clean the zookeeper data , it runs fine . but next time when the > jobmanager failed and redeploy the error occurs again > > > > > ------------------ 原始邮件 ------------------ > *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; > *发送时间:* 2019年11月28日(星期四) 下午3:05 > *收件人:* "曾祥才"<xcz200...@qq.com>; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Again it could not find the state store file: "Caused by: > java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 " > Check why its unable to find. > Better thing is: Clean up zookeeper state and check your configurations, > correct them and restart cluster. > Otherwise it always picks up corrupted state from zookeeper and it will > never restart > > Regards > Bhaskar > > On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200...@qq.com> wrote: > >> i've made a misstake( the log before is another cluster) . the full >> exception log is : >> >> >> INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - >> Recovering all persisted jobs. >> 2019-11-28 02:33:12,726 INFO >> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - >> Starting the SlotManager. >> 2019-11-28 02:33:12,743 INFO >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from >> ZooKeeper. >> 2019-11-28 02:33:12,744 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >> occurred in the cluster entrypoint. >> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take >> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >> at >> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) >> at >> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) >> at >> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) >> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) >> at >> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) >> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> Caused by: java.lang.RuntimeException: >> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph >> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates >> that the retrieved state handle is broken. Try cleaning the state handle >> store. >> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >> at >> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) >> at >> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) >> at >> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) >> ... 7 more >> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >> submitted JobGraph from state handle under >> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state >> handle is broken. Try cleaning the state handle store. >> at >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) >> at >> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) >> ... 9 more >> Caused by: java.io.FileNotFoundException: >> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) >> at java.io.FileInputStream.open0(Native Method) >> at java.io.FileInputStream.open(FileInputStream.java:195) >> >> >> >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >> *发送时间:* 2019年11月28日(星期四) 下午2:46 >> *收件人:* "曾祥才"<xcz200...@qq.com>; >> *抄送:* "User-Flink"<user@flink.apache.org>; >> *主题:* Re: JobGraphs not cleaned up in HA mode >> >> Is it filesystem or hadoop? If its NAS then why the exception "Caused by: >> org.apache.hadoop.hdfs.BlockMissingException: " >> It seems you configured hadoop state store and giving NAS mount. >> >> Regards >> Bhaskar >> >> >> >> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200...@qq.com> wrote: >> >>> /flink/checkpoints is a external persistent store (a nas directory >>> mounts to the job manager) >>> >>> >>> >>> >>> ------------------ 原始邮件 ------------------ >>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >>> *发送时间:* 2019年11月28日(星期四) 下午2:29 >>> *收件人:* "曾祥才"<xcz200...@qq.com>; >>> *抄送:* "user"<user@flink.apache.org>; >>> *主题:* Re: JobGraphs not cleaned up in HA mode >>> >>> Following are the mandatory condition to run in HA: >>> >>> a) You should have persistent common external store for jobmanager and >>> task managers to while writing the state >>> b) You should have persistent external store for zookeeper to store the >>> Jobgraph. >>> >>> Zookeeper is referring path: >>> /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but >>> jobmanager unable to find it. >>> It seems /flink/checkpoints is not the external persistent store >>> >>> >>> Regards >>> Bhaskar >>> >>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote: >>> >>>> hi ,I've the same problem with flink 1.9.1 , any solution to fix it >>>> when the k8s redoploy jobmanager , the error looks like (seems zk not >>>> remove submitted job info, but jobmanager remove the file): >>>> >>>> >>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>>> submitted JobGraph from state handle under >>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved >>>> state >>>> handle is broken. Try cleaning the state handle store. >>>> at >>>> >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) >>>> at >>>> >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) >>>> at >>>> >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) >>>> at >>>> >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) >>>> at >>>> >>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) >>>> at >>>> >>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) >>>> ... 9 more >>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not >>>> obtain >>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494 >>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052) >>>> >>>> >>>> >>>> -- >>>> Sent from: >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >>>> >>>