Failed job reinitiated with wrong checkpoint after a ZK reconnection

Paul Lam Thu, 22 Oct 2020 23:41:29 -0700

Hi,

We have a job of Flink 1.11.0 running on YARN that reached FAILED state cause 
its jobmanager lost leadership 
during a ZK full GC. But after the ZK connection was recovered, somehow the job 
was reinitiated again 
with no checkpoints found in ZK, and hence used an earlier savepoint to restore 
the job, which rewound 
the job unexpectedly.


I’ve filed an issue[1], and any comments are appreciated.

1. https://issues.apache.org/jira/browse/FLINK-19778

Best,
Paul Lam

Failed job reinitiated with wrong checkpoint after a ZK reconnection

Reply via email to