Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22559 )

Change subject: IMPALA-11402: Add limit on files fetched by a single 
getPartialCatalogObject request
......................................................................


Patch Set 18:

(3 comments)

Thanks for the review!

> Also, what happens if the table is modified between two (or more) RPCs? E.g. 
> we write files into all partitions, and the partitions returned in the first 
> TGetPartialCatalogObjectResponse RPC do not contain the new files, but the 
> second batch already contains new files? Is this possible? Is it a new kind 
> of problem or did we already have it in other places?

This seems to be a FAQ now: 
https://gerrit.cloudera.org/c/22559/14/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#1117
Coordinator will retry the planning since the catalog version changes. Updated 
the commit message to mention this.

http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2356
PS17, Line 2356:
> Nit: could add "startup flag 'catalog_partial...'".
Done


http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
File fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java:

http://gerrit.cloudera.org:8080/#/c/22559/17/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java@1109
PS17, Line 1109:       for (int i = numFetchedParts; i < ids.size(); i++) {
> Would it make sense to only request as many partitions in the request as th
The limit is on number of files but we are requesting partitions here. So we 
don't know how many partitions to request to avoid hitting the limit, unless 
catalogd sends back the number of files for each partition while returning the 
partition list. Currently, the partition list is quite lightweight that don't 
have such info.

Coordinator loads the partition list by only setting want_partition_names=true:
https://github.com/apache/impala/blob/f22b805c889d34568008ee0bbea47d04b4ba09b8/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#L942-L953
Catalogd handles it here:
https://github.com/apache/impala/blob/f22b805c889d34568008ee0bbea47d04b4ba09b8/fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java#L225-L226


http://gerrit.cloudera.org:8080/#/c/22559/17/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/22559/17/tests/custom_cluster/test_local_catalog.py@719
PS17, Line 719:     err = ("Too many files to collect in table {0} partition 
year=2009/month=1: 2. "
> Is it logged for each partition here? Would it make sense to all or at leas
It will just log for the first partition since catalogd stops collecting any 
partitions when it finds collecting the first partition already hits the limit. 
BTW, the default limit is 1M. I think it's rare for a single partition to have 
such many files. This test is just to complete the coverage.



--
To view, visit http://gerrit.cloudera.org:8080/22559
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibb13fec20de5a17e7fc33613ca5cdebb9ac1a1e5
Gerrit-Change-Number: 22559
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Thu, 17 Apr 2025 02:40:01 +0000
Gerrit-HasComments: Yes

Reply via email to