Grzegorz Liter created FLINK-35704: -------------------------------------- Summary: ForkJoinPool introduction to NonSplittingRecursiveEnumerator to vastly improve enumeration performance Key: FLINK-35704 URL: https://issues.apache.org/jira/browse/FLINK-35704 Project: Flink Issue Type: Improvement Components: Connectors / FileSystem Reporter: Grzegorz Liter Attachments: ParallelNonSplittingRecursiveEnumerator.java
In current implementation of NonSplittingRecursiveEnumerator the files and directories are enumerated in sequence. In case of accessing a remote storage like S3 the vast amount of time is wasted waiting for a response. What is worse the enumeration is done by JM it self during which it is unresponsive for RPC calls. When accessing multiple (thousands+) files the wait time can quickly add up and can cause a pekko timeout. The performance can be improved by enumerating files in parallel with e.g. ForkJoinPool and parallel streams. I am attaching example implementation that I am happy to contribute to Flink repository. In my tests it cuts the time at least 10x -- This message was sent by Atlassian Jira (v8.20.10#820010)