Noemi Pap-Takacs has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22014 )

Change subject: IMPALA-13154: Update stats when loading an HDFS table
......................................................................


Patch Set 12:

(2 comments)

Thanks for working on this!

http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
File fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java:

http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java@373
PS1, Line 373:     if (part.getWriteId() >= 0)
             :       thriftHdfsPart.setWrite_id(part.getWriteId());
             :     if (type == ThriftObjectType.FULL) {
             :       thriftHdfsPart.setPartition_name(part.getPartitionName());
             :       thriftHdfsPart.setStats(new 
TTableStats(part.getNumRows()));
             :
> Note that when the table is a Hive ACID table or Iceberg V2 table,
 > we use insertFileDescriptors and deleteFileDescriptors , and keep
 > fileDescriptors as empty. For other kinds of HDFS table, we use
 > fileDescriptors and keep insertFileDescriptors and
 > deleteFileDescriptors as empty.

This is not true to Iceberg tables. In V2 tables we count both data and delete 
files simply as fileDescriptors, and do not put them into insertFileDescriptors 
and deleteFileDescriptors. We keep track of the file type in 
GroupedContentFiles.


http://gerrit.cloudera.org:8080/#/c/22014/12/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/22014/12/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@827
PS12, Line 827:     // TODO(todd): would be good to log a summary of the 
loading process:
Just an idea: what about collecting the table/partition loading stats from the 
loaders' LoadStats objects (available in ParallelFileMetadataLoader), and 
summarizing here into FileMetadataStats. Currently these 2 classes are not 
connected and we iterate through the file descriptors twice (once in the 
FileMetadataLoaders during loading and once in HdfsTable) just to get simple 
stats like number of files. We can also log them. See IMPALA-13122



--
To view, visit http://gerrit.cloudera.org:8080/22014
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940
Gerrit-Change-Number: 22014
Gerrit-PatchSet: 12
Gerrit-Owner: Xuebin Su <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 15 Nov 2024 14:37:45 +0000
Gerrit-HasComments: Yes

Reply via email to