I had some zookeeper errors that crashed the cluster ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState - Authentication failed
What happen to Flink checkpoint and state if zookeeper cluster is crashed ? Is it possible that the checkpoint/state is written in zookeeper but not in Hadoop and then when i try to restart the flink cluster im getting the file not find error ?? On Mon, Jun 4, 2018 at 4:27 PM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Miki, > > it looks as if you did not submit a job to the cluster of which you shared > the logs. At least I could not see a submit job call. > > Cheers, > Till > > On Mon, Jun 4, 2018 at 12:31 PM miki haiat <miko5...@gmail.com> wrote: > >> HI Till, >> Iv`e managed to do reproduce it. >> Full log faild_jm.log >> <https://gist.githubusercontent.com/miko-code/e634164404354c4c590be84292fd8cb2/raw/baeee310cd50cfa79303b328e3334d960c8e98e6/faild_jm.log> >> >> >> >> >> On Mon, Jun 4, 2018 at 10:33 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hmmm, Flink should not delete the stored blobs on the HA storage. Could >>> you try to reproduce the problem and then send us the logs on DEBUG level? >>> Please also check before shutting the cluster down, that the files were >>> there. >>> >>> Cheers, >>> Till >>> >>> On Sun, Jun 3, 2018 at 1:10 PM miki haiat <miko5...@gmail.com> wrote: >>> >>>> Hi Till , >>>> >>>> 1. the files are not longer exist in HDFS. >>>> 2. yes , stop and start the cluster from the bin commands. >>>> 3. unfortunately i deleted the log.. :( >>>> >>>> >>>> I wondered if this code could cause this issue , the way in using >>>> checkpoint >>>> >>>> StateBackend sb = new >>>> FsStateBackend("hdfs://***/flink/my_city/checkpoints"); >>>> env.setStateBackend(sb); >>>> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE); >>>> env.getCheckpointConfig().setCheckpointInterval(60000); >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Jun 1, 2018 at 6:19 PM Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> Hi Miki, >>>>> >>>>> could you check whether the files are really no longer stored on HDFS? >>>>> How did you terminate the cluster? Simply calling `bin/stop-cluster.sh`? I >>>>> just tried it locally and it could recover the job after calling >>>>> `bin/start-cluster.sh` again. >>>>> >>>>> What would be helpful are the logs from the initial run of the job. So >>>>> if you can reproduce the problem, then this log would be very helpful. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Thu, May 31, 2018 at 6:14 PM, miki haiat <miko5...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Im having some wierd issue with the JM recovery , >>>>>> I using HDFS and ZOOKEEPER for HA stand alone cluster . >>>>>> >>>>>> Iv stop the cluster change some parameters in the flink conf >>>>>> (Memory). >>>>>> But now when i start the cluster again im having an error that >>>>>> preventing from JM to start. >>>>>> somehow the checkpoint file doesn't exists in HDOOP and JM wont >>>>>> start . >>>>>> >>>>>> full log JM log file >>>>>> <https://gist.github.com/miko-code/28d57b32cb9c4f1aa96fa9873e10e53c> >>>>>> >>>>>> >>>>>>> 2018-05-31 11:57:05,568 ERROR >>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >>>>>>> occurred in the cluster entrypoint. >>>>>> >>>>>> Caused by: java.lang.Exception: Cannot set up the user code >>>>>> libraries: File does not exist: >>>>>> /flink1.5/ha/default/blob/job_5c545fc3f43d69325fb9966b8dd4c8f3/blob_p-5d9f3be555d3b05f90b5e148235d25730eb65b3d-ae486e221962f7b96e36da18fe1c57ca >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>