Grzegorz Liter created FLINK-35704:
--------------------------------------

             Summary: ForkJoinPool introduction to 
NonSplittingRecursiveEnumerator to vastly improve enumeration performance
                 Key: FLINK-35704
                 URL: https://issues.apache.org/jira/browse/FLINK-35704
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem
            Reporter: Grzegorz Liter
         Attachments: ParallelNonSplittingRecursiveEnumerator.java

In current implementation of NonSplittingRecursiveEnumerator the files and 
directories are enumerated in sequence. In case of accessing a remote storage 
like S3 the vast amount of time is wasted waiting for a response.

What is worse the enumeration is done by JM it self during which it is 
unresponsive for RPC calls. When accessing multiple (thousands+) files the wait 
time can quickly add up and can cause a pekko timeout.

The performance can be improved by enumerating files in parallel with e.g. 
ForkJoinPool and parallel streams. I am attaching example implementation that I 
am happy to contribute to Flink repository.

In my tests it cuts the time at least 10x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to