Hello,

We have a savepoint that's ~0.5 TiB in size. When we try to restore from
it, we time out because it takes too long (write now checkpoint timeouts
are set to 2 hours which is way above where we want them already).

I'm curious if it needs to download the entire savepoint to continue. Or,
for further education, what are all the operations that take place before a
job is restored from a savepoint?

Additionally, the network seems to be a big bottleneck. Our network should
be operating in the GiB/s range per instance, but seems to operate between
70-100MiB per second when retrieving a savepoint. Are there any
constraining factors in Flink's design that would slow down the network
download of a savepoint this much (from S3)?

Thanks!

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to