Hi all, We are trying to benchmark savepoint size vs. restore time.
One thing we've observed is that when we reduce the number of task managers, the time to restore from a savepoint increases drastically: 1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes 2/ Restoring from the save savepoint onto 30 task managers takes over 3 hours *Is this expected? How does the restore process work? Is this just a matter of having lower restore parallelism for 30 task managers vs 156 task managers? * Some details - Running on kubernetes - Used Rocksdb with a local ssd for state backend - Savepoint is hosted on GCS - The smaller task manager case is important to us because we expect to deploy our application with a high number of task managers, and downscale once a backfill is completed Differences between 1/ and 2/: 2/ has decreased task manager count 156 -> 30 2/ has decreased operator parallelism by a factor of ~10 2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold rocksdb files Thanks in advance for your help!