Hi Miki, it looks as if you did not submit a job to the cluster of which you shared the logs. At least I could not see a submit job call.
Cheers, Till On Mon, Jun 4, 2018 at 12:31 PM miki haiat <miko5...@gmail.com> wrote: > HI Till, > Iv`e managed to do reproduce it. > Full log faild_jm.log > <https://gist.githubusercontent.com/miko-code/e634164404354c4c590be84292fd8cb2/raw/baeee310cd50cfa79303b328e3334d960c8e98e6/faild_jm.log> > > > > > On Mon, Jun 4, 2018 at 10:33 AM Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hmmm, Flink should not delete the stored blobs on the HA storage. Could >> you try to reproduce the problem and then send us the logs on DEBUG level? >> Please also check before shutting the cluster down, that the files were >> there. >> >> Cheers, >> Till >> >> On Sun, Jun 3, 2018 at 1:10 PM miki haiat <miko5...@gmail.com> wrote: >> >>> Hi Till , >>> >>> 1. the files are not longer exist in HDFS. >>> 2. yes , stop and start the cluster from the bin commands. >>> 3. unfortunately i deleted the log.. :( >>> >>> >>> I wondered if this code could cause this issue , the way in using >>> checkpoint >>> >>> StateBackend sb = new >>> FsStateBackend("hdfs://***/flink/my_city/checkpoints"); >>> env.setStateBackend(sb); >>> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE); >>> env.getCheckpointConfig().setCheckpointInterval(60000); >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Fri, Jun 1, 2018 at 6:19 PM Till Rohrmann <trohrm...@apache.org> >>> wrote: >>> >>>> Hi Miki, >>>> >>>> could you check whether the files are really no longer stored on HDFS? >>>> How did you terminate the cluster? Simply calling `bin/stop-cluster.sh`? I >>>> just tried it locally and it could recover the job after calling >>>> `bin/start-cluster.sh` again. >>>> >>>> What would be helpful are the logs from the initial run of the job. So >>>> if you can reproduce the problem, then this log would be very helpful. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Thu, May 31, 2018 at 6:14 PM, miki haiat <miko5...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Im having some wierd issue with the JM recovery , >>>>> I using HDFS and ZOOKEEPER for HA stand alone cluster . >>>>> >>>>> Iv stop the cluster change some parameters in the flink conf (Memory). >>>>> But now when i start the cluster again im having an error that >>>>> preventing from JM to start. >>>>> somehow the checkpoint file doesn't exists in HDOOP and JM wont start >>>>> . >>>>> >>>>> full log JM log file >>>>> <https://gist.github.com/miko-code/28d57b32cb9c4f1aa96fa9873e10e53c> >>>>> >>>>> >>>>>> 2018-05-31 11:57:05,568 ERROR >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >>>>>> occurred in the cluster entrypoint. >>>>> >>>>> Caused by: java.lang.Exception: Cannot set up the user code libraries: >>>>> File does not exist: >>>>> /flink1.5/ha/default/blob/job_5c545fc3f43d69325fb9966b8dd4c8f3/blob_p-5d9f3be555d3b05f90b5e148235d25730eb65b3d-ae486e221962f7b96e36da18fe1c57ca >>>>> at >>>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72) >>>>> >>>>> >>>>> >>>>> >>>>