rkhachatryan commented on code in PR #25235:
URL: https://github.com/apache/flink/pull/25235#discussion_r1731164817


##########
docs/content/docs/deployment/filesystems/s3.md:
##########
@@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is 
replaced by the random
 If a file system operation does not pass the *"inject entropy"* write option, 
the entropy key substring is simply removed.
 The `s3.entropy.length` defines the number of random alphanumeric characters 
used for entropy.
 
+## s5cmd
+
+Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use 
the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and 
download.
+[Benchmark 
results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support)
 are showing that `s5cmd` can be over 2 times more CPU efficient. 
+Which means either using half the CPU to upload or download the same set of 
files, or doing that twice as fast with the same amount of available CPU.
+
+In order to use this feature, the `s5cmd` binary has to be present and 
accessible to the Flink's task managers, for example via embedding it in the 
used docker image.
+Secondly the path to the `s5cmd` has to be configured via:
+```yaml
+s3.s5cmd.path: /path/to/the/s5cmd
+```
+
+The remaining configuration options (with their default value listed below) 
are:

Review Comment:
   Maybe also refer to s3 access configuration?



##########
docs/content/docs/deployment/filesystems/s3.md:
##########
@@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is 
replaced by the random
 If a file system operation does not pass the *"inject entropy"* write option, 
the entropy key substring is simply removed.
 The `s3.entropy.length` defines the number of random alphanumeric characters 
used for entropy.
 
+## s5cmd
+
+Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use 
the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and 
download.
+[Benchmark 
results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support)
 are showing that `s5cmd` can be over 2 times more CPU efficient. 
+Which means either using half the CPU to upload or download the same set of 
files, or doing that twice as fast with the same amount of available CPU.
+
+In order to use this feature, the `s5cmd` binary has to be present and 
accessible to the Flink's task managers, for example via embedding it in the 
used docker image.
+Secondly the path to the `s5cmd` has to be configured via:
+```yaml
+s3.s5cmd.path: /path/to/the/s5cmd
+```
+
+The remaining configuration options (with their default value listed below) 
are:
+```yaml
+# Extra arguments that will be passed directly to the s5cmd call. Please refer 
to the s5cmd's official documentation.
+s3.s5cmd.args: -r 0
+# Maximum size of files that will be uploaded via a single s5cmd call.
+s3.s5cmd.batch.max-size: 1024mb
+# Maximum number of files that will be uploaded via a single s5cmd call.
+s3.s5cmd.batch.max-files: 100
+```
+Both `s3.s5cmd.batch.max-size` and `s3.s5cmd.batch.max-files` are used to 
control resource usage of the `s5cmd` binary, to prevent it from overloading 
the task manager.
+
+It is recommended to first configure and making sure Flink works without using 
`s5cmd` and only then enabling this feature.
+
+### Credentials
+
+If you are using [access keys](#access-keys-discouraged), they will be passed 
to the `s5cmd`.
+Apart from that `s5cmd` has its own independent (but similar) of Flink way of 
[using 
credentials](https://github.com/peak/s5cmd?tab=readme-ov-file#specifying-credentials).
 
+
+### Limitations
+
+Currently, Flink will use `s5cmd` only during recovery, when downloading state 
files from S3.

Review Comment:
   And only for rocksdb.



##########
docs/content/docs/deployment/filesystems/s3.md:
##########
@@ -164,4 +164,38 @@ The `s3.entropy.key` defines the string in paths that is 
replaced by the random
 If a file system operation does not pass the *"inject entropy"* write option, 
the entropy key substring is simply removed.
 The `s3.entropy.length` defines the number of random alphanumeric characters 
used for entropy.
 
+## s5cmd
+
+Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` can be configured to use 
the [s5cmd tool](https://github.com/peak/s5cmd) for faster file upload and 
download.
+[Benchmark 
results](https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support)
 are showing that `s5cmd` can be over 2 times more CPU efficient. 
+Which means either using half the CPU to upload or download the same set of 
files, or doing that twice as fast with the same amount of available CPU.
+
+In order to use this feature, the `s5cmd` binary has to be present and 
accessible to the Flink's task managers, for example via embedding it in the 
used docker image.
+Secondly the path to the `s5cmd` has to be configured via:

Review Comment:
   Nit:
   ```suggestion
   Secondly, the path to the `s5cmd` has to be configured via:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to