Hello, We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already).
I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint? Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)? Thanks! -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com <https://www.remind.com/> | BLOG <http://blog.remind.com/> | FOLLOW US <https://twitter.com/remindhq> | LIKE US <https://www.facebook.com/remindhq>