Hi For the rate limit, could you please try entropy injection[1]. For checkpoint, Flink will handle the file lifecycle(it will delete the file if it will never be used in the future). The file in the checkpoint will be there if the corresponding checkpoint is still valid.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/filesystems/s3.html#entropy-injection-for-s3-file-systems Best, Congxian Trystan <entro...@gmail.com> 于2020年5月7日周四 上午2:46写道: > Hello! > > Recently we ran into an issue when checkpointing to S3. Because S3 > ratelimits based on prefix, the /shared directory would get slammed and > cause S3 throttling. There is no solution for this, because > /job/checkpoint/:id/shared is all part of the prefix, and is limited to 3,500 > PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix. > > (source: > https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html > ) > > Jobs sometimes also completely crash, and they leave state laying around > when we bring the job up fresh. > > Our solutions have been to 1) reduce the number of taskmanagers 2) reduce > the state.backend.rocksdb.checkpoint.transfer.thread.num back to 1 (we had > increased it to speed up checkpointing/savepoint) and 3) manually delete > tons of old files in the shared directory. > > My question: > Can we safely apply a Lifecycle Policy to the directory/bucket to remove > things? How long is stuff under /shared retained? Is it only for the > duration of the oldest checkpoint, or could it carry forward, untouched, > from the very first checkpoint to the very last? This shared checkpoint > dir/prefix is currently limiting some scalability of our jobs. I don't > believe the _entropy_ trick would help this, because the issue is > ultimately that there's a single shared directory. > > Thank you! > Trystan >