the config (/flink is the NASdirectory ):
jobmanager.rpc.address: flink-jobmanager taskmanager.numberOfTaskSlots: 16 web.upload.dir: /flink/webUpload blob.server.port: 6124 jobmanager.rpc.port: 6123 taskmanager.rpc.port: 6122 jobmanager.heap.size: 1024m taskmanager.heap.size: 1024m high-availability: zookeeper high-availability.cluster-id: /cluster-test high-availability.storageDir: /flink/ha high-availability.zookeeper.quorum: ****:2181 high-availability.jobmanager.port: 6123 high-availability.zookeeper.path.root: /flink/risk-insight high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints state.backend: filesystem state.checkpoints.dir: file:///flink/checkpoints state.savepoints.dir: file:///flink/savepoints state.checkpoints.num-retained: 2 jobmanager.execution.failover-strategy: region jobmanager.archive.fs.dir: file:///flink/archive/history ------------------ ???????? ------------------ ??????: "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; ????????: 2019??11??28??(??????) ????3:12 ??????: "??????"<xcz200...@qq.com>; ????: "User-Flink"<user@flink.apache.org>; ????: Re: JobGraphs not cleaned up in HA mode Can you share the flink configuration once? Regards Bhaskar On Thu, Nov 28, 2019 at 12:09 PM ?????? <xcz200...@qq.com> wrote: if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again ------------------ ???????? ------------------ ??????: "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; ????????: 2019??11??28??(??????) ????3:05 ??????: "??????"<xcz200...@qq.com>; ????: Re: JobGraphs not cleaned up in HA mode Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 " Check why its unable to find. Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster. Otherwise it always picks up corrupted state from zookeeper and it will never restart Regards Bhaskar On Thu, Nov 28, 2019 at 11:51 AM ?????? <xcz200...@qq.com> wrote: i've made a misstake( the log before is another cluster) . the full exception log is : INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs. 2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager. 2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ... 7 more Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) ... 9 more Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) ------------------ ???????? ------------------ ??????: "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; ????????: 2019??11??28??(??????) ????2:46 ??????: "??????"<xcz200...@qq.com>; ????: "User-Flink"<user@flink.apache.org>; ????: Re: JobGraphs not cleaned up in HA mode Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "It seems you configured hadoop state store and giving NAS mount. Regards Bhaskar On Thu, Nov 28, 2019 at 11:36 AM ?????? <xcz200...@qq.com> wrote: /flink/checkpoints is a external persistent store (a nas directory mounts to the job manager) ------------------ ???????? ------------------ ??????: "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; ????????: 2019??11??28??(??????) ????2:29 ??????: "??????"<xcz200...@qq.com>; ????: "user"<user@flink.apache.org>; ????: Re: JobGraphs not cleaned up in HA mode Following are the mandatory condition to run in HA: a) You should have persistent common external store for jobmanager and task managers to while writing the state b) You should have persistent external store for zookeeper to store the Jobgraph. Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it. It seems /flink/checkpoints is not the external persistent store Regards Bhaskar On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote: hi ??I've the same problem with flink 1.9.1 , any solution to fix it when the k8s redoploy jobmanager , the error looks like (seems zk not remove submitted job info, but jobmanager remove the file): Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) ... 9 more Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494 file=/flink/checkpoints/submittedJobGraph480ddf9572ed at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052) -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/