Hello! Recently we ran into an issue when checkpointing to S3. Because S3 ratelimits based on prefix, the /shared directory would get slammed and cause S3 throttling. There is no solution for this, because /job/checkpoint/:id/shared is all part of the prefix, and is limited to 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix.
(source: https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html) Jobs sometimes also completely crash, and they leave state laying around when we bring the job up fresh. Our solutions have been to 1) reduce the number of taskmanagers 2) reduce the state.backend.rocksdb.checkpoint.transfer.thread.num back to 1 (we had increased it to speed up checkpointing/savepoint) and 3) manually delete tons of old files in the shared directory. My question: Can we safely apply a Lifecycle Policy to the directory/bucket to remove things? How long is stuff under /shared retained? Is it only for the duration of the oldest checkpoint, or could it carry forward, untouched, from the very first checkpoint to the very last? This shared checkpoint dir/prefix is currently limiting some scalability of our jobs. I don't believe the _entropy_ trick would help this, because the issue is ultimately that there's a single shared directory. Thank you! Trystan