Csaba Ringhofer created IMPALA-14123:
----------------------------------------
Summary: Allor Iceberg predicate push down on non-partition columns
Key: IMPALA-14123
URL: https://issues.apache.org/jira/browse/IMPALA-14123
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Reporter: Csaba Ringhofer
Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one
predicate can be pushed down for a dictionary columns.
https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
This is a useful optimization in general because if Iceberg can't skip many
files based on stats then calling planFiles() makes planning slower without
benefit. There are some cases though when many files could be skipped during
planning and doing this would make the query much faster. In most cases Impala
would still not read the whole data file, as Parquet min/max filtering would
drop row groups based on the footer, but knowing this during planning would
lead to a more efficient plan.
Some cases where planning time stat filtering could be useful:
1. sorting columns
2. column with some hidden relation to sorting or partition columns
3. cases when Impala's scanner can't do file level stat filtering - the obvious
examples are Avro files, but Parquet/ORC min/max filtering also has limitations
some examples for 2 ("hidden partitioning/clustering"):
a. table is partitioned by country of a point, filter is on x/y coordinates
b. table is sorted by send_date, but filtered by receive_date - the two are
likely to correlate
My idea is to provide a way to configure how Impala behaves (with the exception
of explicit sorting columns, where this info is available):
1. create a query option that enables calling planFiles() in all cases - this
is also useful to make Impala more consistent with other engines, for example
when benchmarking running Impala with both values can give useful insights
2. add a table property that lists the columns where predicates should be
pushed down to Iceberg - this could be also computed automatically by a tool
that checks overlaps among file min/max ranges
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]