Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23363 )

Change subject: IMPALA-14349: Encode FileDescriptors in time in loading Iceberg 
Tables
......................................................................


Patch Set 2:

(7 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java
File fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java:

http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@20
PS1, Line 20: import static 
org.apache.impala.catalog.ParallelFileMetadataLoader.
> line too long (101 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@111
PS1, Line 111:     // and use different handling methods accordingly.
> It seems we are not using the FileSystems. Maybe we can simplify the type t
Makes sense, done.


http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@138
PS1, Line 138: registerNewlyLoadedFd(fd);
             :     }
             :     loadStats_.unknownDiskIds +
> This is now done in three places (here, L129, L163). It could make sense to
Done


http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@224
PS1, Line 224:   private List<IcebergFileDescriptor> parallelListing(
> Will this step take some time on large tables? Can we add a log showing how
collectPartitionPaths() takes some time indeed:

Collecting 504525 files into 18000 partitions: 1.5s
Collecting 754726 files into 107818 partitions: 3.2s

If rewrite it to the following:

    Map<Path, List<ContentFile<?>>> ret = new HashMap<>();
    for (ContentFile<?> cf : contentFiles) {
      CharSequence charSeq = cf.path();
      int lastSlash = charSeq.length() - 1;
      for (; lastSlash >= 0; --lastSlash) {
        if (charSeq.charAt(lastSlash) == 47) break;
      }
      charSeq = charSeq.subSequence(0, lastSlash + 1);
      Path p = new Path(String.valueOf(charSeq));
      List<ContentFile<?>> cfList = ret.get(p);
      if (cfList != null) {
        cfList.add(cf);
      } else {
        cfList = new ArrayList<>();
        cfList.add(cf);
        ret.put(p, cfList);
      }
    }

It takes much less time:

Collecting 504525 files into 18000 partitions: 0.8s
Collecting 754726 files into 107818 partitions: 1.3s

But the current code is more readable and safer, also in a real environment the 
RPCs will likely dominate the execution time.

Added the log statement to collectPartitionPaths().


http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@268
PS1, Line 268:
> Will non-HDFS storages use this method as well? If no, could you please add
We use this code path for all FSs that support storage ids.
Added a comment. I can also remove this method, but the code might be a bit 
more readable with it.


http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergFileMetadataLoader.java@296
PS1, Line 296:     Reference<Long> localNumUnknownDiskIds = new Reference<>(0L);
> Can we declare numUnknownDiskIds as AtomicLong to get rid of synchronized?
I can do it, I only hesitate because it will require changing quite a few 
method signatures elsewhere, just because of IcebergFileMetadataLoader.

Performance-wise I don't think this short synchronized block is problematic, as 
the bulk of the work is file listing and file descriptor creation.


http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/IcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/23363/1/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java@517
PS1, Line 517: content file
> Maybe "content file list" or "content file paths"? First I thought it actua
Done



--
To view, visit http://gerrit.cloudera.org:8080/23363
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia1c2a7119d76db7ce7c43caec2ccb122a014851b
Gerrit-Change-Number: 23363
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 05 Sep 2025 14:25:17 +0000
Gerrit-HasComments: Yes

Reply via email to