Re: Checkpoint metadata deleted by Flink after ZK connection issues

Cristian Tue, 08 Sep 2020 09:29:12 -0700

I'm using the standalone script to start the cluster. 

As far as I can tell, it's not easy to reproduce. We found that zookeeper lost 
a node around the time this happened, but all of our other 75 Flink jobs which 
use the same setup, version and zookeeper, didn't have any issues. They didn't 
even restart.


So unfortunately I don't know how to reproduce this. All I know is I can't 
sleep. I have nightmares were my precious state is deleted. I wake up crying 
and quickly start manually savepointing all jobs just in case, because I feel 
the day of reckon is near. Flinkpocalypse!

On Tue, Sep 8, 2020, at 5:54 AM, Robert Metzger wrote:
> Thanks a lot for reporting this problem here Cristian!
> 
> I am not super familiar with the involved components, but the behavior you 
> are describing doesn't sound right to me.
> Which entrypoint are you using? This is logged at the beginning, like this: 
> "2020-09-08 14:45:32,807 INFO  
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -  Starting 
> StandaloneSessionClusterEntrypoint (Version: 1.11.1, Scala: 2.12, 
> Rev:7eb514a, Date:2020-07-15T07:02:09+02:00)"
> 
> Do you know by chance if this problem is reproducible? With the 
> StandaloneSessionClusterEntrypoint I was not able to reproduce the problem.
> 
> 
> 
> 
> On Tue, Sep 8, 2020 at 4:00 AM Husky Zeng <568793...@qq.com> wrote:
>> Hi Cristian,
>> 
>> 
>> I don't know if it was designed to be like this deliberately.
>> 
>> So I have already submitted an issue ,and wait for somebody to response.
>> 
>> https://issues.apache.org/jira/browse/FLINK-19154   
>> 
>> 
>> 
>> --
>> Sent from: 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Checkpoint metadata deleted by Flink after ZK connection issues

Reply via email to