Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/22559 )
Change subject: IMPALA-11402: Add limit on files fetched by a single getPartialCatalogObject request ...................................................................... Patch Set 18: (3 comments) Thanks for the review! > Also, what happens if the table is modified between two (or more) RPCs? E.g. > we write files into all partitions, and the partitions returned in the first > TGetPartialCatalogObjectResponse RPC do not contain the new files, but the > second batch already contains new files? Is this possible? Is it a new kind > of problem or did we already have it in other places? This seems to be a FAQ now: https://gerrit.cloudera.org/c/22559/14/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#1117 Coordinator will retry the planning since the catalog version changes. Updated the commit message to mention this. http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2356 PS17, Line 2356: > Nit: could add "startup flag 'catalog_partial...'". Done http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java File fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java: http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java@1109 PS17, Line 1109: for (int i = numFetchedParts; i < ids.size(); i++) { > Would it make sense to only request as many partitions in the request as th The limit is on number of files but we are requesting partitions here. So we don't know how many partitions to request to avoid hitting the limit, unless catalogd sends back the number of files for each partition while returning the partition list. Currently, the partition list is quite lightweight that don't have such info. Coordinator loads the partition list by only setting want_partition_names=true: https://github.com/apache/impala/blob/f22b805c889d34568008ee0bbea47d04b4ba09b8/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#L942-L953 Catalogd handles it here: https://github.com/apache/impala/blob/f22b805c889d34568008ee0bbea47d04b4ba09b8/fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java#L225-L226 http://gerrit.cloudera.org:8080/#/c/22559/17/tests/custom_cluster/test_local_catalog.py File tests/custom_cluster/test_local_catalog.py: http://gerrit.cloudera.org:8080/#/c/22559/17/tests/custom_cluster/test_local_catalog.py@719 PS17, Line 719: err = ("Too many files to collect in table {0} partition year=2009/month=1: 2. " > Is it logged for each partition here? Would it make sense to all or at leas It will just log for the first partition since catalogd stops collecting any partitions when it finds collecting the first partition already hits the limit. BTW, the default limit is 1M. I think it's rare for a single partition to have such many files. This test is just to complete the coverage. -- To view, visit http://gerrit.cloudera.org:8080/22559 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibb13fec20de5a17e7fc33613ca5cdebb9ac1a1e5 Gerrit-Change-Number: 22559 Gerrit-PatchSet: 18 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Thu, 17 Apr 2025 02:40:01 +0000 Gerrit-HasComments: Yes