Piotr Nowojski created FLINK-35739:
--------------------------------------

             Summary: FLIP-444: Native file copy support
                 Key: FLINK-35739
                 URL: https://issues.apache.org/jira/browse/FLINK-35739
             Project: Flink
          Issue Type: New Feature
          Components: Connectors / FileSystem
            Reporter: Piotr Nowojski


https://cwiki.apache.org/confluence/display/FLINK/FLIP-444%3A+Native+file+copy+support

State downloading in Flink can be a time and CPU consuming operation, which is 
especially visible if CPU resources per task slot are strictly restricted to 
for example a single CPU. Downloading 1GB of state size can take significant 
amount of time, while the code doing so is quite inefficient.

Currently when downloading state files, Flink is creating an FSDataInputStream 
from the remote file, and copies its bytes, to an OutputStream pointing to a 
local file (in the RocksDBStateDownloader#downloadDataForStateHandle method). 
FSDataInputStream internally is being wrapped by many layers of abstractions 
and indirections and what’s worse, every file is being copied individually, 
which leads to quite high overheads for small files. Download times and 
download process CPU efficiency can be significantly improved if we introduced 
an API to allow org.apache.flink.core.fs.FileSystem to copy many files natively 
and all at once.

For S3, there are at least two potential implementations. The first one is 
using AWS SDKv2 directly (Flink currently is using AWS SDKv1 wrapped by 
hadoop/presto) and Amazon S3 Transfer Manager. Second option is to use a 3rd 
party tool called s5cmd. It is claimed to be a faster alternative to the 
official AWS clients, which was confirmed by our benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to