Reducing Task Manager Count Greatly Increases Savepoint Restore

Kevin Lam Wed, 07 Apr 2021 07:06:54 -0700

Hi all,

We are trying to benchmark savepoint size vs. restore time.


One thing we've observed is that when we reduce the number of task
managers, the time to restore from a savepoint increases drastically:

1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes
2/ Restoring from the save savepoint onto 30 task managers takes over 3
hours

*Is this expected? How does the restore process work? Is this just a matter
of having lower restore parallelism for 30 task managers vs 156 task
managers? *

Some details

- Running on kubernetes
- Used Rocksdb with a local ssd for state backend
- Savepoint is hosted on GCS
- The smaller task manager case is important to us because we expect to
deploy our application with a high number of task managers, and downscale
once a backfill is completed

Differences between 1/ and 2/:

2/ has decreased task manager count 156 -> 30
2/ has decreased operator parallelism by a factor of ~10
2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold
rocksdb files

Thanks in advance for your help!

Reducing Task Manager Count Greatly Increases Savepoint Restore

Reply via email to