I had some   zookeeper errors that  crashed the cluster

 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState
  - Authentication failed

What happen to Flink checkpoint and state if zookeeper cluster is crashed  ?
Is it possible that the checkpoint/state is written in zookeeper   but not
in Hadoop and then when i try to restart the flink cluster im getting the
file not find error ??


On Mon, Jun 4, 2018 at 4:27 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Miki,
>
> it looks as if you did not submit a job to the cluster of which you shared
> the logs. At least I could not see a submit job call.
>
> Cheers,
> Till
>
> On Mon, Jun 4, 2018 at 12:31 PM miki haiat <miko5...@gmail.com> wrote:
>
>> HI Till,
>> Iv`e managed to do  reproduce it.
>> Full log faild_jm.log
>> <https://gist.githubusercontent.com/miko-code/e634164404354c4c590be84292fd8cb2/raw/baeee310cd50cfa79303b328e3334d960c8e98e6/faild_jm.log>
>>
>>
>>
>>
>> On Mon, Jun 4, 2018 at 10:33 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hmmm, Flink should not delete the stored blobs on the HA storage. Could
>>> you try to reproduce the problem and then send us the logs on DEBUG level?
>>> Please also check before shutting the cluster down, that the files were
>>> there.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Sun, Jun 3, 2018 at 1:10 PM miki haiat <miko5...@gmail.com> wrote:
>>>
>>>> Hi  Till ,
>>>>
>>>>    1. the files are not longer exist in HDFS.
>>>>    2. yes , stop and start the cluster from the bin commands.
>>>>    3.  unfortunately i deleted the log.. :(
>>>>
>>>>
>>>> I wondered if this code could cause this issue , the way in using
>>>> checkpoint
>>>>
>>>> StateBackend sb = new 
>>>> FsStateBackend("hdfs://***/flink/my_city/checkpoints");
>>>> env.setStateBackend(sb);
>>>> env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
>>>> env.getCheckpointConfig().setCheckpointInterval(60000);
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 1, 2018 at 6:19 PM Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Miki,
>>>>>
>>>>> could you check whether the files are really no longer stored on HDFS?
>>>>> How did you terminate the cluster? Simply calling `bin/stop-cluster.sh`? I
>>>>> just tried it locally and it could recover the job after calling
>>>>> `bin/start-cluster.sh` again.
>>>>>
>>>>> What would be helpful are the logs from the initial run of the job. So
>>>>> if you can reproduce the problem, then this log would be very helpful.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, May 31, 2018 at 6:14 PM, miki haiat <miko5...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Im having some wierd issue with the JM recovery ,
>>>>>> I using HDFS and ZOOKEEPER for HA stand alone cluster .
>>>>>>
>>>>>> Iv  stop the cluster change some parameters in the flink conf
>>>>>> (Memory).
>>>>>> But now when i start the cluster again im having an error that
>>>>>> preventing from JM to start.
>>>>>> somehow the checkpoint file doesn't exists in HDOOP  and JM wont
>>>>>> start .
>>>>>>
>>>>>> full log JM log file
>>>>>> <https://gist.github.com/miko-code/28d57b32cb9c4f1aa96fa9873e10e53c>
>>>>>>
>>>>>>
>>>>>>> 2018-05-31 11:57:05,568 ERROR
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
>>>>>>> occurred in the cluster entrypoint.
>>>>>>
>>>>>> Caused by: java.lang.Exception: Cannot set up the user code
>>>>>> libraries: File does not exist:
>>>>>> /flink1.5/ha/default/blob/job_5c545fc3f43d69325fb9966b8dd4c8f3/blob_p-5d9f3be555d3b05f90b5e148235d25730eb65b3d-ae486e221962f7b96e36da18fe1c57ca
>>>>>> at
>>>>>> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>

Reply via email to