Hello!

Recently we ran into an issue when checkpointing to S3. Because S3
ratelimits based on prefix, the /shared directory would get slammed and
cause S3 throttling. There is no solution for this, because
/job/checkpoint/:id/shared is all part of the prefix, and is limited to 3,500
PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix.

(source:
https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html)

Jobs sometimes also completely crash, and they leave state laying around
when we bring the job up fresh.

Our solutions have been to 1) reduce the number of taskmanagers 2) reduce
the state.backend.rocksdb.checkpoint.transfer.thread.num back to 1 (we had
increased it to speed up checkpointing/savepoint) and 3) manually delete
tons of old files in the shared directory.

My question:
Can we safely apply a Lifecycle Policy to the directory/bucket to remove
things? How long is stuff under /shared retained? Is it only for the
duration of the oldest checkpoint, or could it carry forward, untouched,
from the very first checkpoint to the very last? This shared checkpoint
dir/prefix is currently limiting some scalability of our jobs. I don't
believe the _entropy_ trick would help this, because the issue is
ultimately that there's a single shared directory.

Thank you!
Trystan

Reply via email to