Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/22014 )
Change subject: IMPALA-13154: Update stats when loading an HDFS table ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/22014/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/22014/4/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1626 PS4, Line 1626: for (FileDescriptor fd: descriptors) { : numBlocks_ += fd.getNumFileBlocks(); : totalFileBytes_ += fd.getFileLength(); : } To avoid duplicate codes, I think we can extract these into a method, e.g. FileMetadataStats#update(List<FileDescriptor>), and use one field for FileMetadataStats instead of adding three fields for numBlocks_, numFIles_, and totalFileBytes_. http://gerrit.cloudera.org:8080/#/c/22014/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/22014/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2484 PS4, Line 2484: * To prevent the catalog from hitting an OOM error while trying to : * serialize large partition incremental stats, we estimate the stats size and filter : * the incremental stats data from partition objects if the estimate exceeds : * --inc_stats_size_limit_bytes. This is stale after IMPALA-8667. Can we remove it by the way? https://gerrit.cloudera.org/c/13671/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java http://gerrit.cloudera.org:8080/#/c/22014/4/tests/custom_cluster/test_web_pages.py File tests/custom_cluster/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/22014/4/tests/custom_cluster/test_web_pages.py@38 PS4, Line 38: def get_and_check_status(url): Maybe we can use get_debug_page() directly: https://github.com/apache/impala/blob/5c9b82854a3397572c4adb3c6cebb2e7d1aaad9a/tests/common/impala_test_suite.py#L495-L499 http://gerrit.cloudera.org:8080/#/c/22014/4/tests/custom_cluster/test_web_pages.py@514 PS4, Line 514: def test_catalog_tables_stats_after_describe(self): In case this feature is broken in local catalog mode in the future, we'd better test it in local catalog mode as well. Usually we do this by reusing this method in two tests. E.g. @CustomClusterTestSuite.with_args( catalogd_args="--catalog_topic_mode=full", impalad_args="--use_local_catalog=false") def test_catalog_tables_stats_legacy_catalog(self): self._test_catalog_tables_stats_after_describe() @CustomClusterTestSuite.with_args( catalogd_args="--catalog_topic_mode=minimal", impalad_args="--use_local_catalog=true") def test_catalog_tables_stats_local_catalog(self): self._test_catalog_tables_stats_after_describe() def _test_catalog_tables_stats_after_describe(self): ... Note that util methods should have a name not starts with "test" so pytest won't run them as tests. Here we add a underscore "_" before the util method name. http://gerrit.cloudera.org:8080/#/c/22014/4/tests/custom_cluster/test_web_pages.py@540 PS4, Line 540: assert table_in_list(response_json, "high_file_count_tables") Can we verify the values as well? E.g. the table has 24 files. The estimate memory size is 127152. -- To view, visit http://gerrit.cloudera.org:8080/22014 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940 Gerrit-Change-Number: 22014 Gerrit-PatchSet: 4 Gerrit-Owner: Xuebin Su <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Xuebin Su <[email protected]> Gerrit-Comment-Date: Mon, 04 Nov 2024 12:36:47 +0000 Gerrit-HasComments: Yes
