That's an excellent question. I can't explain that. All I know is this: - the job was upgraded and resumed from a savepoint - After hours of working fine, it failed (like it shows in the logs) - the Metadata was cleaned up, again as shown in the logs - because I run this in Kubernetes, the container was restarted immediately, and because nothing was found in zookeeper it started again from the savepoint
I didn't realize this was happening after a couple of hours later. At that point the job had already checkpointed several times, and it was futile to try to start it from a retained checkpoint (assuming there were any). My question is... Is this a bug or not? On Mon, Sep 7, 2020, at 1:53 AM, Husky Zeng wrote: > I means that checkpoints are usually dropped after the job was terminated by > the user (except if explicitly configured as retained Checkpoints). You > could use "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION" to save > your checkpoint when te cames to failure. > > When your zookeeper lost connection,the High-Availability system ,which rely > on zookeeper was also failure, it leads to your application stop without > retry. > > I hava a question , if your application lost zookeeper connection,how did > it delete the data in zookeeper? > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >