Re: Flink Taskmanager failure recovery and large state

2021-04-07 Thread Yaroslav Tkachenko
Hi Dhanesh, Thanks for the recommendation! I'll try it out. On Wed, Apr 7, 2021 at 1:59 AM dhanesh arole wrote: > Hi Yaroslav, > > We faced similar issues in our large stateful stream processing job. I had > asked question >

Re: Flink Taskmanager failure recovery and large state

2021-04-07 Thread dhanesh arole
Hi Yaroslav, We faced similar issues in our large stateful stream processing job. I had asked question about it on a user mailing list a few days back. Based on the reply to

Re: Flink Taskmanager failure recovery and large state

2021-04-06 Thread Robert Metzger
Hey Yaroslav, GCS is a somewhat popular filesystem that should work fine with Flink. It seems that the initial scale of a bucket is 5000 read requests per second (https://cloud.google.com/storage/docs/request-rate), your job should be at roughly the same rate (depending on how fast your job resta

Re: Flink Taskmanager failure recovery and large state

2021-04-01 Thread Guowei Ma
Hi, Yaroslav AFAIK Flink does not retry if the download checkpoint from the storage fails. On the other hand the FileSystem already has this retry mechanism already. So I think there is no need for flink to retry. I am not very sure but from the log it seems that the gfs's retry is interrupted by

Re: Flink Taskmanager failure recovery and large state

2021-04-01 Thread Yaroslav Tkachenko
Hi Guowei, I thought Flink can support any HDFS-compatible object store like the majority of Big Data frameworks. So we just added "flink-shaded-hadoop-2-uber" and "gcs-connector-latest-hadoop2" dependencies to the classpath, after that using "gs" prefix seems to be possible: state.checkpoints.di

Re: Flink Taskmanager failure recovery and large state

2021-04-01 Thread Guowei Ma
Hi, Yaroslav AFAIK there is no official GCS FileSystem support in FLINK. Does the GCS is implemented by yourself? Would you like to share the whole log of jm? BTW: From the following log I think the implementation has already some retry mechanism. >>> Interrupted while sleeping before retry. Giv