[ 
https://issues.apache.org/jira/browse/IMPALA-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-12765:
---------------------------------------
    Description: 
During scheduling Impala does the following:
 * Non-Iceberg tables
 ** The scheduler processes the scan ranges in *partition key order*
 ** The scheduler selects N replicas as candidates
 ** The scheduler chooses the executor from the candidates based on minimum 
number of assigned bytes
 ** So consecutive partitions are more likely to be assigned to different 
executors
 * Iceberg tables
 ** The scheduler processes the scan ranges in *random order*
 ** The scheduler selects N replicas as candidates
 ** The scheduler chooses the executor from the candidates based on minimum 
number of assigned bytes
 ** So consecutive partitions (by partition key order) are assigned randomly, 
i.e. there's a higher chances of clustering

If the IcebergScanNode ordered its file descriptors based on their paths we 
would have a more balanced scheduling for consecutive partitions. Queries that 
operate on a range of partitions are quite common, so it makes sense to 
optimize that case.

It is especially important for queries that prune partitions via runtime 
filters (e.g. due to a JOIN), because it doesn't matter that we schedule the 
scan ranges evenly, the scan ranges that survive the runtime filters can still 
be clustered on certain executors.

E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:

inv_date_sk=d_date_sk and
d_month_seq between 1199 and 1199 + 11

The Inventory table is partitioned by column inv_date_sk, and we filter the 
rows in the joined table by 'd_month_seq between 1199 and 1199 + 11'. This 
means that we will only need a range of partitions from the Inventory table, 
but that range will only be revealed during runtime. Scheduling neighbouring 
partitions to different executors means that the surviving partitions are 
spread across executors more evenly.

  was:
During scheduling Impala does the following:

* Non-Iceberg tables
** The scheduler processes the scan ranges in partition key order
** The scheduler selects N replicas as candidates
** The scheduler chooses the executor from the candidates based on minimum 
number of assigned bytes
** So consecutive partitions are more likely to be assigned to different 
executors
* Iceberg tables
** The scheduler processes the scan ranges in random order
** The scheduler selects N replicas as candidates
** The scheduler chooses the executor from the candidates based on minimum 
number of assigned bytes
** So consecutive partitions (by partition key order) are assigned randomly, 
i.e. there's a higher chances of clustering

If the IcebergScanNode ordered its file descriptors based on their paths we 
would have a more balanced scheduling for consecutive partitions. Queries that 
operate on a range of partitions are quite common, so it makes sense to 
optimize that case.


> Balance consecutive partitions better for Iceberg tables
> --------------------------------------------------------
>
>                 Key: IMPALA-12765
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12765
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.4.0
>
>
> During scheduling Impala does the following:
>  * Non-Iceberg tables
>  ** The scheduler processes the scan ranges in *partition key order*
>  ** The scheduler selects N replicas as candidates
>  ** The scheduler chooses the executor from the candidates based on minimum 
> number of assigned bytes
>  ** So consecutive partitions are more likely to be assigned to different 
> executors
>  * Iceberg tables
>  ** The scheduler processes the scan ranges in *random order*
>  ** The scheduler selects N replicas as candidates
>  ** The scheduler chooses the executor from the candidates based on minimum 
> number of assigned bytes
>  ** So consecutive partitions (by partition key order) are assigned randomly, 
> i.e. there's a higher chances of clustering
> If the IcebergScanNode ordered its file descriptors based on their paths we 
> would have a more balanced scheduling for consecutive partitions. Queries 
> that operate on a range of partitions are quite common, so it makes sense to 
> optimize that case.
> It is especially important for queries that prune partitions via runtime 
> filters (e.g. due to a JOIN), because it doesn't matter that we schedule the 
> scan ranges evenly, the scan ranges that survive the runtime filters can 
> still be clustered on certain executors.
> E.g. TPC-DS Q22 has the following JOIN and WHERE predicates:
> inv_date_sk=d_date_sk and
> d_month_seq between 1199 and 1199 + 11
> The Inventory table is partitioned by column inv_date_sk, and we filter the 
> rows in the joined table by 'd_month_seq between 1199 and 1199 + 11'. This 
> means that we will only need a range of partitions from the Inventory table, 
> but that range will only be revealed during runtime. Scheduling neighbouring 
> partitions to different executors means that the surviving partitions are 
> spread across executors more evenly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to