[jira] [Created] (IMPALA-14123) Allor Iceberg predicate push down on non-partition columns

Csaba Ringhofer (Jira) Wed, 04 Jun 2025 02:00:24 -0700

Csaba Ringhofer created IMPALA-14123:
----------------------------------------


             Summary: Allor Iceberg predicate push down on non-partition columns
                 Key: IMPALA-14123
                 URL: https://issues.apache.org/jira/browse/IMPALA-14123
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
            Reporter: Csaba Ringhofer


Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one 
predicate can be pushed down for a dictionary columns.  
https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828

This is a useful optimization in general because if Iceberg can't skip many 
files based on stats then calling planFiles() makes planning slower without 
benefit. There are some cases though when many files could be skipped during 
planning and doing this would make the query much faster. In most cases Impala 
would still not read the whole data file, as Parquet min/max filtering would 
drop row groups based on the footer, but knowing this during planning would 
lead to a more efficient plan.

Some cases where planning time stat filtering could be useful:
1. sorting columns
2. column with some hidden relation to sorting or partition columns
3. cases when Impala's scanner can't do file level stat filtering - the obvious 
examples are Avro files, but Parquet/ORC min/max filtering also has limitations

some examples for 2 ("hidden partitioning/clustering"):
a. table is partitioned by country of a point, filter is on x/y coordinates
b. table is sorted by send_date, but filtered by receive_date - the two are 
likely to correlate
 
My idea is to provide a way to configure how Impala behaves (with the exception 
of explicit sorting columns, where this info is available):
1. create a query option that enables calling planFiles() in all cases - this 
is also useful to make Impala more consistent with other engines, for example 
when benchmarking running Impala with both values can give useful insights
2. add a table property that lists the columns where predicates should be 
pushed down to Iceberg - this could be also computed automatically by a tool 
that checks overlaps among file min/max ranges



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (IMPALA-14123) Allor Iceberg predicate push down on non-partition columns

Reply via email to