Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/23363 )
Change subject: IMPALA-14349: Encode FileDescriptors in time in loading Iceberg Tables ...................................................................... Patch Set 2: (7 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java: http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@20 PS1, Line 20: import static org.apache.impala.catalog.ParallelFileMetadataLoader. > line too long (101 > 90) Done http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@111 PS1, Line 111: // and use different handling methods accordingly. > It seems we are not using the FileSystems. Maybe we can simplify the type t Makes sense, done. http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@138 PS1, Line 138: registerNewlyLoadedFd(fd); : } : loadStats_.unknownDiskIds + > This is now done in three places (here, L129, L163). It could make sense to Done http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@224 PS1, Line 224: private List<IcebergFileDescriptor> parallelListing( > Will this step take some time on large tables? Can we add a log showing how collectPartitionPaths() takes some time indeed: Collecting 504525 files into 18000 partitions: 1.5s Collecting 754726 files into 107818 partitions: 3.2s If rewrite it to the following: Map<Path, List<ContentFile<?>>> ret = new HashMap<>(); for (ContentFile<?> cf : contentFiles) { CharSequence charSeq = cf.path(); int lastSlash = charSeq.length() - 1; for (; lastSlash >= 0; --lastSlash) { if (charSeq.charAt(lastSlash) == 47) break; } charSeq = charSeq.subSequence(0, lastSlash + 1); Path p = new Path(String.valueOf(charSeq)); List<ContentFile<?>> cfList = ret.get(p); if (cfList != null) { cfList.add(cf); } else { cfList = new ArrayList<>(); cfList.add(cf); ret.put(p, cfList); } } It takes much less time: Collecting 504525 files into 18000 partitions: 0.8s Collecting 754726 files into 107818 partitions: 1.3s But the current code is more readable and safer, also in a real environment the RPCs will likely dominate the execution time. Added the log statement to collectPartitionPaths(). http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@268 PS1, Line 268: > Will non-HDFS storages use this method as well? If no, could you please add We use this code path for all FSs that support storage ids. Added a comment. I can also remove this method, but the code might be a bit more readable with it. http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@296 PS1, Line 296: Reference<Long> localNumUnknownDiskIds = new Reference<>(0L); > Can we declare numUnknownDiskIds as AtomicLong to get rid of synchronized? I can do it, I only hesitate because it will require changing quite a few method signatures elsewhere, just because of IcebergFileMetadataLoader. Performance-wise I don't think this short synchronized block is problematic, as the bulk of the work is file listing and file descriptor creation. http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java: http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@517 PS1, Line 517: content file > Maybe "content file list" or "content file paths"? First I thought it actua Done -- To view, visit http://gerrit.cloudera.org:8080/23363 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia1c2a7119d76db7ce7c43caec2ccb122a014851b Gerrit-Change-Number: 23363 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 05 Sep 2025 14:25:17 +0000 Gerrit-HasComments: Yes
