[ 
https://issues.apache.org/jira/browse/IMPALA-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-14123:
-------------------------------------
    Description: 
Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one 
predicate can be pushed down for a partition column.  
https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828

This is a useful optimization in general because if Iceberg can't skip many 
files based on stats then calling planFiles() makes planning slower without 
benefit. There are some cases though when many files could be skipped during 
planning and doing this would make the query much faster. In most cases Impala 
would still not read the whole data file, as Parquet min/max filtering would 
drop row groups based on the footer, but knowing this during planning would 
lead to a more efficient plan.

Some cases where planning time stat filtering could be useful:
1. sorting columns
2. column with some hidden relation to sorting or partition columns
3. cases when Impala's scanner can't do file level stat filtering - the obvious 
examples are Avro files, but Parquet/ORC min/max filtering also has limitations

some examples for 2 ("hidden partitioning/clustering"):
a. table is partitioned by country of a point, filter is on x/y coordinates
b. table is sorted by send_date, but filtered by receive_date - the two are 
likely to correlate
 
My idea is to provide a way to configure how Impala behaves (with the exception 
of explicit sorting columns, where this info is available):
1. create a query option that enables calling planFiles() in all cases - this 
is also useful to make Impala more consistent with other engines, for example 
when benchmarking running Impala with both values can give useful insights
2. add a table property that lists the columns where predicates should be 
pushed down to Iceberg - this could be also computed automatically by a tool 
that checks overlaps among file min/max ranges

  was:
Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one 
predicate can be pushed down for a dictionary columns.  
https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828

This is a useful optimization in general because if Iceberg can't skip many 
files based on stats then calling planFiles() makes planning slower without 
benefit. There are some cases though when many files could be skipped during 
planning and doing this would make the query much faster. In most cases Impala 
would still not read the whole data file, as Parquet min/max filtering would 
drop row groups based on the footer, but knowing this during planning would 
lead to a more efficient plan.

Some cases where planning time stat filtering could be useful:
1. sorting columns
2. column with some hidden relation to sorting or partition columns
3. cases when Impala's scanner can't do file level stat filtering - the obvious 
examples are Avro files, but Parquet/ORC min/max filtering also has limitations

some examples for 2 ("hidden partitioning/clustering"):
a. table is partitioned by country of a point, filter is on x/y coordinates
b. table is sorted by send_date, but filtered by receive_date - the two are 
likely to correlate
 
My idea is to provide a way to configure how Impala behaves (with the exception 
of explicit sorting columns, where this info is available):
1. create a query option that enables calling planFiles() in all cases - this 
is also useful to make Impala more consistent with other engines, for example 
when benchmarking running Impala with both values can give useful insights
2. add a table property that lists the columns where predicates should be 
pushed down to Iceberg - this could be also computed automatically by a tool 
that checks overlaps among file min/max ranges


> Allow Iceberg predicate push down on non-partition columns
> ----------------------------------------------------------
>
>                 Key: IMPALA-14123
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14123
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: iceberg, impala-iceberg, performance
>
> Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one 
> predicate can be pushed down for a partition column.  
> https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
> This is a useful optimization in general because if Iceberg can't skip many 
> files based on stats then calling planFiles() makes planning slower without 
> benefit. There are some cases though when many files could be skipped during 
> planning and doing this would make the query much faster. In most cases 
> Impala would still not read the whole data file, as Parquet min/max filtering 
> would drop row groups based on the footer, but knowing this during planning 
> would lead to a more efficient plan.
> Some cases where planning time stat filtering could be useful:
> 1. sorting columns
> 2. column with some hidden relation to sorting or partition columns
> 3. cases when Impala's scanner can't do file level stat filtering - the 
> obvious examples are Avro files, but Parquet/ORC min/max filtering also has 
> limitations
> some examples for 2 ("hidden partitioning/clustering"):
> a. table is partitioned by country of a point, filter is on x/y coordinates
> b. table is sorted by send_date, but filtered by receive_date - the two are 
> likely to correlate
>  
> My idea is to provide a way to configure how Impala behaves (with the 
> exception of explicit sorting columns, where this info is available):
> 1. create a query option that enables calling planFiles() in all cases - this 
> is also useful to make Impala more consistent with other engines, for example 
> when benchmarking running Impala with both values can give useful insights
> 2. add a table property that lists the columns where predicates should be 
> pushed down to Iceberg - this could be also computed automatically by a tool 
> that checks overlaps among file min/max ranges



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to