[ 
https://issues.apache.org/jira/browse/IMPALA-11986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060759#comment-18060759
 ] 

ASF subversion and git services commented on IMPALA-11986:
----------------------------------------------------------

Commit bf7c2088dd5495a763ff9a381970f99e6101cd4b in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=bf7c2088d ]

IMPALA-11986: (part 1) Optimize partition key scans for Iceberg tables

This patch optimizes queries that only scan IDENTITY-partitioned
columns. The optimization only applies, if:
* All materialized aggregate expressions have distinct semantics
  (e.g. MIN, MAX, NDV). In other words, this optimization will work
  for COUNT(DISTINCT c) but not COUNT(c).
* All materialized columns are IDENTITY-partitioned in all partition
  specs (this can be relaxed later)

If the above conditions are met, then each data file (without deletes)
only produce a single record. The rest of the table (data files with
deletes and delete files) are scanned normally.

Testing:
* added e2e tests

Change-Id: I32f78ee60ac4a410e91cf0e858199dd39d2e9afe
Reviewed-on: http://gerrit.cloudera.org:8080/23985
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT part_col)/ queries for 
> Iceberg tables
> -------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11986
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11986
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Li Penglin
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg, performance
>
> For Iceberg V1 and V2 tables without deletes:
> [https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html]
>  OPTIMIZE_PARTITION_KEY_SCANS optimizes the MIN(key_column), MAX(key_column), 
> and COUNT(DISTINCT key_column) by 'TBLS' table and 'PARTITION_KEY_VALS' 
> partition key column in the HMS metadata. For the Iceberg tables, its 
> partitioning stats is not stored in the HMS, but can be obtained through the 
> Iceberg API. We can optimize query performance for MIN(key_column), 
> MAX(key_column), or COUNT(DISTINCT key_column) by similar idea, but we should 
> make sure that 'Partition Transforms' is 'identity'.
> For non-partitioned columns, if min-max information is stored in Iceberg 
> meta, the MIN(column) and MAX(column) queries can also be optimized based on 
> this idea?
> But impala does not guarantee that the statistics for these non-partitioned 
> columns are complete, it's confusing things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to