Re: Zookeeper connection loss causing checkpoint corruption

2020-09-22 Thread Arpith P
I created a ticket with all my findings. https://issues.apache.org/jira/browse/FLINK-19359. Thanks, Arpith On Tue, Sep 22, 2020 at 12:16 PM Timo Walther wrote: > Hi Arpith, > > is there a JIRA ticket for this issue already? If not, it would be great > if you can report it. This sounds like a cr

Re: Zookeeper connection loss causing checkpoint corruption

2020-09-21 Thread Timo Walther
Hi Arpith, is there a JIRA ticket for this issue already? If not, it would be great if you can report it. This sounds like a critical priority issue to me. Thanks, Timo On 22.09.20 06:25, Arpith P wrote: Hi Peter, I have recently had a similar issue where I could not load from the checkpoi

Re: Zookeeper connection loss causing checkpoint corruption

2020-09-21 Thread Arpith P
Hi Peter, I have recently had a similar issue where I could not load from the checkpoints path. I found that whenever a corrupt checkpoint happens the "_metadata" file will not be persisted, and I've a program which tracks if checkpoint location based on this strategy and updates DB with location

Zookeeper connection loss causing checkpoint corruption

2020-09-21 Thread Peter Westermann
I recently ran into an issue with our Flink cluster: A zookeeper service deploy caused a temporary connection loss and triggered a new jobmanager leader election. Leadership election was successful and our Flink job restarted from the last checkpoint. This checkpoint appears to have been taken w