Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22014 )

Change subject: IMPALA-13154: Update stats when load an HDFS table
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/22014/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22014/1//COMMIT_MSG@7
PS1, Line 7: load
nit: "loading"


http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java
File fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java:

http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java@369
PS1, Line 369: getFileDescriptors
This would be an expensive operation for large tables, e.g. with 5M files. It 
transforms the encodedFileDescriptors back to FileDescriptors:
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1561

Before this patch, we do this in HdfsTable#getTHdfsTable() which is OK and 
expected to be slow for large tables. Now we do this in HdfsTable#load() where 
we just loaded the table and converted FileDescriptors to 
encodedFileDescriptors. It's a waste to convert them back just for counting 
these stats. We can add the stats in HdfsPartition$Builder when setting the 
FileDescriptors:
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1585
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1600
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1608

HdfsPartition$Builder can pass them to HdfsPartition in the constructors.


http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java@373
PS1, Line 373:     if (!partition.getInsertFileDescriptors().isEmpty()) {
             :       stats.numFiles = 
partition.getInsertFileDescriptors().size() +
             :           partition.getDeleteFileDescriptors().size();
             :     } else {
             :       stats.numFiles = partition.getFileDescriptors().size();
             :     }
I think we can simply set 'numFiles' to HdfsPartition#getNumFileDescriptors().

Note that when the table is a Hive ACID table or Iceberg V2 table, we use 
insertFileDescriptors and deleteFileDescriptors , and keep fileDescriptors as 
empty. For other kinds of HDFS table, we use fileDescriptors and keep 
insertFileDescriptors and deleteFileDescriptors as empty.
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L157-L162

So we simply count the number of files as a sum of them:
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1057-L1059

As the above comment mentioned, we'd better calculate these states when loading 
(setting) the FileDescriptors.


http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1240
PS1, Line 1240:     memUsageEstimate += fileMetadataStats_.numFiles * 
PER_FD_MEM_USAGE_BYTES +
              :         fileMetadataStats_.numBlocks * 
PER_BLOCK_MEM_USAGE_BYTES;
'fileMetadataStats_' is still the old stats before we update it in the next 
line. I think we should use 'newStats' here.


http://gerrit.cloudera.org:8080/#/c/22014/1/tests/webserver/test_web_pages.py
File tests/webserver/test_web_pages.py:

http://gerrit.cloudera.org:8080/#/c/22014/1/tests/webserver/test_web_pages.py@535
PS1, Line 535:     assert table_in_list(response_json, "large_tables")
             :     assert table_in_list(response_json, "high_file_count_tables")
These might fail since we only show the top-25 tables:
https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/monitor/CatalogTableMetrics.java#L56-L57

This is an e2e test that runs concurrently with other tests. There could be 
other tables loaded and the current table won't show up in the list. To make 
this test stable, we'd better move it into a custom-cluster test, e.g. in 
tests/custom_cluster/test_web_pages.py

BTW, we can use an existing table, e.g. functional.alltypes, and check some 
stats like numFiles.



--
To view, visit http://gerrit.cloudera.org:8080/22014
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940
Gerrit-Change-Number: 22014
Gerrit-PatchSet: 1
Gerrit-Owner: Xuebin Su <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Mon, 04 Nov 2024 00:37:01 +0000
Gerrit-HasComments: Yes

Reply via email to