Aftering looking through lots of graphs and AWS limits. I've come to the
conclusion that we're hitting limits on our disk writes. I'm guessing this
is backpressuring against the entire restore process. I'm still very
curious about all the steps involved in savepoint restoration though!

On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <r...@remind101.com> wrote:

> Hello,
>
> We have a savepoint that's ~0.5 TiB in size. When we try to restore from
> it, we time out because it takes too long (write now checkpoint timeouts
> are set to 2 hours which is way above where we want them already).
>
> I'm curious if it needs to download the entire savepoint to continue. Or,
> for further education, what are all the operations that take place before a
> job is restored from a savepoint?
>
> Additionally, the network seems to be a big bottleneck. Our network should
> be operating in the GiB/s range per instance, but seems to operate between
> 70-100MiB per second when retrieving a savepoint. Are there any
> constraining factors in Flink's design that would slow down the network
> download of a savepoint this much (from S3)?
>
> Thanks!
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>


-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to