Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/22014 )
Change subject: IMPALA-13154: Update stats when load an HDFS table ...................................................................... Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/22014/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22014/1//COMMIT_MSG@7 PS1, Line 7: load nit: "loading" http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java File fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java: http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java@369 PS1, Line 369: getFileDescriptors This would be an expensive operation for large tables, e.g. with 5M files. It transforms the encodedFileDescriptors back to FileDescriptors: https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1561 Before this patch, we do this in HdfsTable#getTHdfsTable() which is OK and expected to be slow for large tables. Now we do this in HdfsTable#load() where we just loaded the table and converted FileDescriptors to encodedFileDescriptors. It's a waste to convert them back just for counting these stats. We can add the stats in HdfsPartition$Builder when setting the FileDescriptors: https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1585 https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1600 https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1608 HdfsPartition$Builder can pass them to HdfsPartition in the constructors. http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java@373 PS1, Line 373: if (!partition.getInsertFileDescriptors().isEmpty()) { : stats.numFiles = partition.getInsertFileDescriptors().size() + : partition.getDeleteFileDescriptors().size(); : } else { : stats.numFiles = partition.getFileDescriptors().size(); : } I think we can simply set 'numFiles' to HdfsPartition#getNumFileDescriptors(). Note that when the table is a Hive ACID table or Iceberg V2 table, we use insertFileDescriptors and deleteFileDescriptors , and keep fileDescriptors as empty. For other kinds of HDFS table, we use fileDescriptors and keep insertFileDescriptors and deleteFileDescriptors as empty. https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java#L157-L162 So we simply count the number of files as a sum of them: https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1057-L1059 As the above comment mentioned, we'd better calculate these states when loading (setting) the FileDescriptors. http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/22014/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1240 PS1, Line 1240: memUsageEstimate += fileMetadataStats_.numFiles * PER_FD_MEM_USAGE_BYTES + : fileMetadataStats_.numBlocks * PER_BLOCK_MEM_USAGE_BYTES; 'fileMetadataStats_' is still the old stats before we update it in the next line. I think we should use 'newStats' here. http://gerrit.cloudera.org:8080/#/c/22014/1/tests/webserver/test_web_pages.py File tests/webserver/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/22014/1/tests/webserver/test_web_pages.py@535 PS1, Line 535: assert table_in_list(response_json, "large_tables") : assert table_in_list(response_json, "high_file_count_tables") These might fail since we only show the top-25 tables: https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/fe/src/main/java/org/apache/impala/catalog/monitor/CatalogTableMetrics.java#L56-L57 This is an e2e test that runs concurrently with other tests. There could be other tables loaded and the current table won't show up in the list. To make this test stable, we'd better move it into a custom-cluster test, e.g. in tests/custom_cluster/test_web_pages.py BTW, we can use an existing table, e.g. functional.alltypes, and check some stats like numFiles. -- To view, visit http://gerrit.cloudera.org:8080/22014 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940 Gerrit-Change-Number: 22014 Gerrit-PatchSet: 1 Gerrit-Owner: Xuebin Su <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Mon, 04 Nov 2024 00:37:01 +0000 Gerrit-HasComments: Yes
