Peter Rozsa has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/20700 )
Change subject: IMPALA-12299: Parallelize file listings of Iceberg tables on HDFS/Ozone ...................................................................... IMPALA-12299: Parallelize file listings of Iceberg tables on HDFS/Ozone This change replaces the single-threaded file metadata listing of Iceberg datafiles with a pool-based multithreaded solution. The thread-pool size is calculated based on the filesystem's type, and it's maximized through MAX_HDFS_PARTITIONS_PARALLEL_LOAD and MAX_NON_HDFS_PARTITIONS_PARALLEL_LOAD. The parallel tasks are created from the parent directories of the datafiles, this guarantees that every datafile is listed. Manually executed benchmarks with following properties: - 280.000 partitions, 1 files each (worst case) - Thread pool size is 5 (default value for HDFS) - Used minicluster setup as a test-bench The results showed 3-4x improvement for getFileStatuses(): - Self-time of getFileStatuses: 6.599 ms vs 25.399 ms - Query time: 16.37 s vs 34.00 s Tests: exhaustive test suite ran Change-Id: Ic5ca7e873f4ad0cc8dab6a77b62e05d965b4a76d --- M fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java 2 files changed, 59 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/20700/4 -- To view, visit http://gerrit.cloudera.org:8080/20700 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic5ca7e873f4ad0cc8dab6a77b62e05d965b4a76d Gerrit-Change-Number: 20700 Gerrit-PatchSet: 4 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
