rkhachatryan commented on code in PR #25235: URL: https://github.com/apache/flink/pull/25235#discussion_r1731164817
########## docs/content/docs/deployment/filesystems/s3.md: ########## @@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is replaced by the random If a file system operation does not pass the *"inject entropy"* write option, the entropy key substring is simply removed. The `s3.entropy.length` defines the number of random alphanumeric characters used for entropy. +## s5cmd + +Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and download. +[Benchmark results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support) are showing that `s5cmd` can be over 2 times more CPU efficient. +Which means either using half the CPU to upload or download the same set of files, or doing that twice as fast with the same amount of available CPU. + +In order to use this feature, the `s5cmd` binary has to be present and accessible to the Flink's task managers, for example via embedding it in the used docker image. +Secondly the path to the `s5cmd` has to be configured via: +```yaml +s3.s5cmd.path: /path/to/the/s5cmd +``` + +The remaining configuration options (with their default value listed below) are: Review Comment: Maybe also refer to s3 access configuration? ########## docs/content/docs/deployment/filesystems/s3.md: ########## @@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is replaced by the random If a file system operation does not pass the *"inject entropy"* write option, the entropy key substring is simply removed. The `s3.entropy.length` defines the number of random alphanumeric characters used for entropy. +## s5cmd + +Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and download. +[Benchmark results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support) are showing that `s5cmd` can be over 2 times more CPU efficient. +Which means either using half the CPU to upload or download the same set of files, or doing that twice as fast with the same amount of available CPU. + +In order to use this feature, the `s5cmd` binary has to be present and accessible to the Flink's task managers, for example via embedding it in the used docker image. +Secondly the path to the `s5cmd` has to be configured via: +```yaml +s3.s5cmd.path: /path/to/the/s5cmd +``` + +The remaining configuration options (with their default value listed below) are: +```yaml +# Extra arguments that will be passed directly to the s5cmd call. Please refer to the s5cmd's official documentation. +s3.s5cmd.args: -r 0 +# Maximum size of files that will be uploaded via a single s5cmd call. +s3.s5cmd.batch.max-size: 1024mb +# Maximum number of files that will be uploaded via a single s5cmd call. +s3.s5cmd.batch.max-files: 100 +``` +Both `s3.s5cmd.batch.max-size` and `s3.s5cmd.batch.max-files` are used to control resource usage of the `s5cmd` binary, to prevent it from overloading the task manager. + +It is recommended to first configure and making sure Flink works without using `s5cmd` and only then enabling this feature. + +### Credentials + +If you are using [access keys](#access-keys-discouraged), they will be passed to the `s5cmd`. +Apart from that `s5cmd` has its own independent (but similar) of Flink way of [using credentials](https://github.com/peak/s5cmd?tab=readme-ov-file#specifying-credentials). + +### Limitations + +Currently, Flink will use `s5cmd` only during recovery, when downloading state files from S3. Review Comment: And only for rocksdb. ########## docs/content/docs/deployment/filesystems/s3.md: ########## @@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is replaced by the random If a file system operation does not pass the *"inject entropy"* write option, the entropy key substring is simply removed. The `s3.entropy.length` defines the number of random alphanumeric characters used for entropy. +## s5cmd + +Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and download. +[Benchmark results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support) are showing that `s5cmd` can be over 2 times more CPU efficient. +Which means either using half the CPU to upload or download the same set of files, or doing that twice as fast with the same amount of available CPU. + +In order to use this feature, the `s5cmd` binary has to be present and accessible to the Flink's task managers, for example via embedding it in the used docker image. +Secondly the path to the `s5cmd` has to be configured via: Review Comment: Nit: ```suggestion Secondly, the path to the `s5cmd` has to be configured via: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org