?????? JobGraphs not cleaned up in HA mode

?????? Wed, 27 Nov 2019 22:51:08 -0800

the config&nbsp; (/flink is the NASdirectory ):&nbsp;&nbsp;


jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

&nbsp;




------------------&nbsp;????????&nbsp;------------------
??????:&nbsp;"Vijay Bhaskar"<bhaskar.eba...@gmail.com&gt;;
????????:&nbsp;2019??11??28??(??????) ????3:12
??????:&nbsp;"??????"<xcz200...@qq.com&gt;;
????:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
????:&nbsp;Re: JobGraphs not cleaned up in HA mode



Can you share the flink configuration&nbsp;once?

Regards
Bhaskar


On Thu, Nov 28, 2019 at 12:09 PM ?????? <xcz200...@qq.com&gt; wrote:

if i clean the zookeeper data , it runs fine .&nbsp; but next time when the 
jobmanager failed and redeploy the error occurs again








------------------&nbsp;????????&nbsp;------------------
??????:&nbsp;"Vijay Bhaskar"<bhaskar.eba...@gmail.com&gt;;
????????:&nbsp;2019??11??28??(??????) ????3:05
??????:&nbsp;"??????"<xcz200...@qq.com&gt;;

????:&nbsp;Re: JobGraphs not cleaned up in HA mode



Again it could not find the state store file: "Caused by: 
java.io.FileNotFoundException: 
/flink/ha/submittedJobGraph0c6bcff01199&nbsp;"&nbsp;Check why its unable to 
find.&nbsp;
Better thing is: Clean up zookeeper state and check your configurations, 
correct them and restart cluster.
Otherwise it always&nbsp;picks up corrupted state from zookeeper and it will 
never restart


Regards
Bhaskar


On Thu, Nov 28, 2019 at 11:51 AM ?????? <xcz200...@qq.com&gt; wrote:

i've made a misstake( the log before is another cluster) . the full exception 
log is :


INFO&nbsp; org.apache.flink.runtime.dispatcher.StandaloneDispatcher&nbsp; 
&nbsp; &nbsp; - Recovering all persisted jobs.&nbsp;
2019-11-28 02:33:12,726 INFO&nbsp; 
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl&nbsp; - 
Starting the SlotManager.&nbsp;
2019-11-28 02:33:12,743 INFO&nbsp; 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore&nbsp; - 
Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from 
ZooKeeper.&nbsp;
2019-11-28 02:33:12,744 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint&nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp;- Fatal error occurred in the cluster entrypoint.&nbsp;
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take 
leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.&nbsp;
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)&nbsp;
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)&nbsp;
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)&nbsp;
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)&nbsp;
        at 
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)&nbsp;
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)&nbsp;
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)&nbsp;
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)&nbsp;
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)&nbsp;
        at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)&nbsp;
        at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)&nbsp;
        at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)&nbsp;
        at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)&nbsp;
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: 
Could not retrieve submitted JobGraph from state handle under 
/639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state 
handle is broken. Try cleaning the state handle store.&nbsp;
        at 
org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)&nbsp;
        at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)&nbsp;
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)&nbsp;
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)&nbsp;
        ... 7 more&nbsp;
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This 
indicates that the retrieved state handle is broken. Try cleaning the state 
handle store.&nbsp;
        at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)&nbsp;
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)&nbsp;
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)&nbsp;
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)&nbsp;
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)&nbsp;
        at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)&nbsp;
        ... 9 more&nbsp;
Caused by: java.io.FileNotFoundException: 
/flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)&nbsp;
        at java.io.FileInputStream.open0(Native Method)&nbsp;
        at java.io.FileInputStream.open(FileInputStream.java:195)&nbsp;



&nbsp;




------------------&nbsp;????????&nbsp;------------------
??????:&nbsp;"Vijay Bhaskar"<bhaskar.eba...@gmail.com&gt;;
????????:&nbsp;2019??11??28??(??????) ????2:46
??????:&nbsp;"??????"<xcz200...@qq.com&gt;;
????:&nbsp;"User-Flink"<user@flink.apache.org&gt;;
????:&nbsp;Re: JobGraphs not cleaned up in HA mode



Is it filesystem or hadoop? If its NAS then why the exception "Caused by: 
org.apache.hadoop.hdfs.BlockMissingException:&nbsp;"It seems you configured 
hadoop state store and giving NAS mount.&nbsp;


Regards
Bhaskar


&nbsp;


On Thu, Nov 28, 2019 at 11:36 AM ?????? <xcz200...@qq.com&gt; wrote:

/flink/checkpoints&nbsp; is a external persistent store&nbsp;(a nas directory 
mounts to the job manager)








------------------&nbsp;????????&nbsp;------------------
??????:&nbsp;"Vijay Bhaskar"<bhaskar.eba...@gmail.com&gt;;
????????:&nbsp;2019??11??28??(??????) ????2:29
??????:&nbsp;"??????"<xcz200...@qq.com&gt;;
????:&nbsp;"user"<user@flink.apache.org&gt;;
????:&nbsp;Re: JobGraphs not cleaned up in HA mode



Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task 
managers to while writing the state
b) You should have persistent external store for zookeeper to store the 
Jobgraph.


Zookeeper is referring&nbsp; path: 
/flink/checkpoints/submittedJobGraph480ddf9572ed&nbsp; to get the job graph but 
jobmanager unable to find it.
It seems /flink/checkpoints&nbsp; is not the external persistent store




Regards
Bhaskar


On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com&gt; wrote:

hi ??I've the same problem with flink 1.9.1 , any solution to fix it
 when the k8s redoploy jobmanager ,&nbsp; the error looks like (seems zk not
 remove submitted job info, but jobmanager remove the file):&nbsp; 
 
 
 Caused by: org.apache.flink.util.FlinkException: Could not retrieve
 submitted JobGraph from state handle under
 /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
 handle is broken. Try cleaning the state handle store.
 &nbsp; &nbsp; &nbsp; &nbsp; at
 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
 &nbsp; &nbsp; &nbsp; &nbsp; at
 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
 &nbsp; &nbsp; &nbsp; &nbsp; ... 9 more
 Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
 block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
 file=/flink/checkpoints/submittedJobGraph480ddf9572ed
 &nbsp; &nbsp; &nbsp; &nbsp; at
 
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
 
 
 
 --
 Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

?????? JobGraphs not cleaned up in HA mode

Reply via email to