Hi all,

We are trying to benchmark savepoint size vs. restore time.

One thing we've observed is that when we reduce the number of task
managers, the time to restore from a savepoint increases drastically:

1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes
2/ Restoring from the save savepoint onto 30 task managers takes over 3
hours

*Is this expected? How does the restore process work? Is this just a matter
of having lower restore parallelism for 30 task managers vs 156 task
managers? *

Some details

- Running on kubernetes
- Used Rocksdb with a local ssd for state backend
- Savepoint is hosted on GCS
- The smaller task manager case is important to us because we expect to
deploy our application with a high number of task managers, and downscale
once a backfill is completed

Differences between 1/ and 2/:

2/ has decreased task manager count 156 -> 30
2/ has decreased operator parallelism by a factor of ~10
2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold
rocksdb files

Thanks in advance for your help!

Reply via email to