Glad to hear that you could solve/mitigate the problem and thanks for
letting us know.
Cheers,
Till
On Sat, Feb 1, 2020 at 2:45 PM Richard Deurwaarder wrote:
> Hi Till & others,
>
> We enabled setFailOnCheckpointingErrors
> (setTolerableCheckpointFailureNumber isn't available in 1.8) and this
>
Hi Till & others,
We enabled setFailOnCheckpointingErrors
(setTolerableCheckpointFailureNumber isn't available in 1.8) and this
indeed prevents the large number of restarts.
Hopefully a solution for the reported issue[1] with google gets found but
for now this solved our immediate problem.
Thank
If a checkpoint is not successful, it cannot be used for recovery.
That means Flink will restart to the last successful checkpoint and hence
not lose any data.
On Wed, Jan 29, 2020 at 9:52 PM wvl wrote:
> Forgive my lack of knowledge here - I'm a bit out of my league here.
>
> But I was wonderin
Forgive my lack of knowledge here - I'm a bit out of my league here.
But I was wondering if allowing e.g. 1 checkpoint to fail and the reason
for which somehow caused a record to be lost (e.g. rocksdb exception /
taskmanager crash / etc), there would be no Source rewind to the last
successful chec
Hi Till,
I'll see if we can ask google to comment on those issues, perhaps they have
a fix in the works that would solve the root problem.
In the meanwhile
`CheckpointConfig.setTolerableCheckpointFailureNumber` sounds very
promising!
Thank you for this. I'm going to try this tomorrow to see if tha
Hi Richard,
googling a bit indicates that this might actually be a GCS problem [1, 2,
3]. The proposed solution/workaround so far is to retry the whole upload
operation as part of the application logic. Since I assume that you are
writing to GCS via Hadoop's file system this should actually fall i
Hi all,
We've got a Flink job running on 1.8.0 which writes its state (rocksdb) to
Google Cloud Storage[1]. We've noticed that jobs with a large amount of
state (500gb range) are becoming *very* unstable. In the order of
restarting once an hour or even more.
The reason for this instability is tha