That's an excellent question. I can't explain that. All I know is this:

- the job was upgraded and resumed from a savepoint 
- After hours of working fine, it failed (like it shows in the logs) 
- the Metadata was cleaned up, again as shown in the logs
- because I run this in Kubernetes, the container was restarted immediately, 
and because nothing was found in zookeeper it started again from the savepoint 

I didn't realize this was happening after a couple of hours later. At that 
point the job had already checkpointed several times, and it was futile to try 
to start it from a retained checkpoint (assuming there were any). 

My question is... Is this a bug or not? 

On Mon, Sep 7, 2020, at 1:53 AM, Husky Zeng wrote:
> I means that checkpoints are usually dropped after the job was terminated by
> the user (except if explicitly configured as retained Checkpoints).   You
> could use "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION" to save
> your checkpoint when te cames to failure.
> 
> When your zookeeper lost connection,the High-Availability system ,which rely
> on zookeeper was also failure, it leads to your application stop without
> retry.  
> 
> I hava a question ,  if your application lost zookeeper connection,how did
> it delete the data in zookeeper?
> 
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to