Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-29 Thread Till Rohrmann
Great, thanks Klou! Cheers, Till On Mon, Sep 28, 2020 at 5:07 PM Kostas Kloudas wrote: > Hi all, > > I will have a look. > > Kostas > > On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann > wrote: > > > > Hi Cristian, > > > > thanks for reporting this issue. It looks indeed like a very critical > pr

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-28 Thread Kostas Kloudas
Hi all, I will have a look. Kostas On Mon, Sep 28, 2020 at 3:56 PM Till Rohrmann wrote: > > Hi Cristian, > > thanks for reporting this issue. It looks indeed like a very critical problem. > > The problem seems to be that the ApplicationDispatcherBootstrap class > produces an exception (that th

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-28 Thread Till Rohrmann
Hi Cristian, thanks for reporting this issue. It looks indeed like a very critical problem. The problem seems to be that the ApplicationDispatcherBootstrap class produces an exception (that the request job can no longer be found because of a lost ZooKeeper connection) which will be interpreted as

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-09 Thread Yang Wang
> The job sub directory will be cleaned up when the job finished/canceled/failed. Since we could submit multiple jobs into a Flink session, what i mean is when a job reached to the terminal state, the sub node(e.g. /flink/application_/running_job_registry/4d255397c7aeb5327adb567238c983c1) on th

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Cristian
> The job sub directory will be cleaned up when the job > finished/canceled/failed. What does this mean? Also, to clarify: I'm a very sloppy developer. My jobs crash ALL the time... and yet, the jobs would ALWAYS resume from the last checkpoint. The only cases where I expect Flink to clean u

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Yang Wang
AFAIK, the HA data, including Zookeeper meta data and real data on DFS, will only be cleaned up when the Flink cluster reached terminated state. So if you are using a session cluster, the root cluster node on Zk will be cleaned up after you manually stop the session cluster. The job sub directory

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Cristian
I'm using the standalone script to start the cluster. As far as I can tell, it's not easy to reproduce. We found that zookeeper lost a node around the time this happened, but all of our other 75 Flink jobs which use the same setup, version and zookeeper, didn't have any issues. They didn't eve

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Robert Metzger
Thanks a lot for reporting this problem here Cristian! I am not super familiar with the involved components, but the behavior you are describing doesn't sound right to me. Which entrypoint are you using? This is logged at the beginning, like this: "2020-09-08 14:45:32,807 INFO org.apache.flink.ru

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-07 Thread Husky Zeng
Hi Cristian, I don't know if it was designed to be like this deliberately. So I have already submitted an issue ,and wait for somebody to response. https://issues.apache.org/jira/browse/FLINK-19154 -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-07 Thread Cristian
That's an excellent question. I can't explain that. All I know is this: - the job was upgraded and resumed from a savepoint - After hours of working fine, it failed (like it shows in the logs) - the Metadata was cleaned up, again as shown in the logs - because I run this in Kubernetes, the conta

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-07 Thread Husky Zeng
I means that checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). You could use "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION" to save your checkpoint when te cames to failure. When your zookeeper lost connect

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-05 Thread Cristian
> If you want to save your checkPoint,you could refer to this document What do you mean? We already persist our savepoints, and we do not delete them explicitly ever. The problem is that Flink deleted the data from zookeeper when it shouldn't have. Is it possible to start a job from a checkpo

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-05 Thread Husky Zeng
Hi Cristian, >From this code , we could see that the Exception or Error was ignored in dispatcher.shutDownCluster(applicationStatus) . `` org.apache.flink.runtime.dispatcher.DispatcherGateway#shutDownCluster return applicationCompletionFuture .handle((r, t) -> {

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-04 Thread Cristian
My suspicion is that somewhere in the path were it fails to connect yo zookeeper, the exception is swallowed, so instead of running the shutdown path for when the job fails, the general shutdown path is taken. This was fortunately a job for which we had a savepoint from yesterday. Otherwise

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-04 Thread Qingdong Zeng
Hi Cristian, In the log,we can see it went to the method shutDownAsync(applicationStatus,null,true); `` 2020-09-04 17:32:07,950 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting StandaloneApplicationClusterEntryPoint down w

Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-04 Thread Cristian
Hello guys. We run a stand-alone cluster that runs a single job (if you are familiar with the way Ververica Platform runs Flink jobs, we use a very similar approach). It runs Flink 1.11.1 straight from the official docker image. Usually, when our jobs crash for any reason, they will resume from