[
https://issues.apache.org/jira/browse/IMPALA-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang resolved IMPALA-9695.
------------------------------------
Resolution: Duplicate
This duplicates IMPALA-4105.
> Support incomplete partition spec in REFRESH statement
> ------------------------------------------------------
>
> Key: IMPALA-9695
> URL: https://issues.apache.org/jira/browse/IMPALA-9695
> Project: IMPALA
> Issue Type: New Feature
> Components: Catalog
> Reporter: Quanlong Huang
> Priority: Critical
>
> We support explicitly specify a partition in the REFRESH statement. When
> users have several partitions to refresh, they have to trigger several
> REFRESH statements. Each REFRESH statement requires the table lock so they'll
> be executed in the catalogd one by one. What's worse, the table is updated
> (catalog version bumped) several times, which may cause catalogd propagates
> it several times to the coordinators. It's bad for huge tables that contain a
> large number of partitions. Their catalog objects have huge size since
> catalogd can't send incremental updates for only changed partitions.
> A possible scenario is hourly partitioned tables that have more than one
> level partition keys:
> {code:sql}
> create table hourly_part_tbl (id int, msg string)
> partitioned by (hour_id bigint, event_type bigint)
> {code}
> Let's say there are 20 event_types. Every hour there will be 10 partitions
> generated with a new hour_id. If the retention time for this table is 2
> years, the total number of partitions will be 2 * 365 * 24 * 20 = 175,200.
> The catalog object size for this table wil be huge, especially there will be
> many columns and hence incrementa stats in practise.
> Every hour, users have to run 20 REFRESH statements one by one on this table.
> The catalog server will send 20 updates to coordinators for this table. It's
> possible that catalogd is always busy in loading metadata for this table in a
> busy cluster (with many other tables).
> One possible solution is using REFRESH without the partition spec.
> Unfortunately, we still load FileStatus for all loaded partitions. It's
> possible that this single statement can't finish in an hour.
> Another solution is support REFRESH statement with incomplete partition spec.
> So users can use one statement:
> {code:java}
> REFRESH hourly_part_tbl PARTITION(hour_id=xxx);
> {code}
> Then catalogd only needs to acquire the table lock once and send its catalog
> update once.
> It'd also be usefull if we support non-equality predicates in the partition
> spec:
> {code:sql}
> REFRESH hourly_part_tbl PARTITION(hour_id >= xxx);
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]