Re: Does flink support retries on checkpoint write failures

2020-02-03 Thread Till Rohrmann
Glad to hear that you could solve/mitigate the problem and thanks for letting us know. Cheers, Till On Sat, Feb 1, 2020 at 2:45 PM Richard Deurwaarder wrote: > Hi Till & others, > > We enabled setFailOnCheckpointingErrors > (setTolerableCheckpointFailureNumber isn't available in 1.8) and this >

Re: Does flink support retries on checkpoint write failures

2020-02-01 Thread Richard Deurwaarder
Hi Till & others, We enabled setFailOnCheckpointingErrors (setTolerableCheckpointFailureNumber isn't available in 1.8) and this indeed prevents the large number of restarts. Hopefully a solution for the reported issue[1] with google gets found but for now this solved our immediate problem. Thank

Re: Does flink support retries on checkpoint write failures

2020-01-30 Thread Arvid Heise
If a checkpoint is not successful, it cannot be used for recovery. That means Flink will restart to the last successful checkpoint and hence not lose any data. On Wed, Jan 29, 2020 at 9:52 PM wvl wrote: > Forgive my lack of knowledge here - I'm a bit out of my league here. > > But I was wonderin

Re: Does flink support retries on checkpoint write failures

2020-01-29 Thread wvl
Forgive my lack of knowledge here - I'm a bit out of my league here. But I was wondering if allowing e.g. 1 checkpoint to fail and the reason for which somehow caused a record to be lost (e.g. rocksdb exception / taskmanager crash / etc), there would be no Source rewind to the last successful chec

Re: Does flink support retries on checkpoint write failures

2020-01-29 Thread Richard Deurwaarder
Hi Till, I'll see if we can ask google to comment on those issues, perhaps they have a fix in the works that would solve the root problem. In the meanwhile `CheckpointConfig.setTolerableCheckpointFailureNumber` sounds very promising! Thank you for this. I'm going to try this tomorrow to see if tha

Re: Does flink support retries on checkpoint write failures

2020-01-29 Thread Till Rohrmann
Hi Richard, googling a bit indicates that this might actually be a GCS problem [1, 2, 3]. The proposed solution/workaround so far is to retry the whole upload operation as part of the application logic. Since I assume that you are writing to GCS via Hadoop's file system this should actually fall i

Does flink support retries on checkpoint write failures

2020-01-28 Thread Richard Deurwaarder
Hi all, We've got a Flink job running on 1.8.0 which writes its state (rocksdb) to Google Cloud Storage[1]. We've noticed that jobs with a large amount of state (500gb range) are becoming *very* unstable. In the order of restarting once an hour or even more. The reason for this instability is tha