[
https://issues.apache.org/jira/browse/IMPALA-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-14123:
-------------------------------------
Description:
Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one
predicate can be pushed down for a partition column.
https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
This is a useful optimization in general because if Iceberg can't skip many
files based on stats then calling planFiles() makes planning slower without
benefit. There are some cases though when many files could be skipped during
planning and doing this would make the query much faster. In most cases Impala
would still not read the whole data file, as Parquet min/max filtering would
drop row groups based on the footer, but knowing this during planning would
lead to a more efficient plan.
Some cases where planning time stat filtering could be useful:
1. sorting columns
2. column with some hidden relation to sorting or partition columns
3. cases when Impala's scanner can't do file level stat filtering - the obvious
examples are Avro files, but Parquet/ORC min/max filtering also has limitations
some examples for 2 ("hidden partitioning/clustering"):
a. table is partitioned by country of a point, filter is on x/y coordinates
b. table is sorted by send_date, but filtered by receive_date - the two are
likely to correlate
My idea is to provide a way to configure how Impala behaves (with the exception
of explicit sorting columns, where this info is available):
1. create a query option that enables calling planFiles() in all cases - this
is also useful to make Impala more consistent with other engines, for example
when benchmarking running Impala with both values can give useful insights
2. add a table property that lists the columns where predicates should be
pushed down to Iceberg - this could be also computed automatically by a tool
that checks overlaps among file min/max ranges
was:
Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one
predicate can be pushed down for a dictionary columns.
https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
This is a useful optimization in general because if Iceberg can't skip many
files based on stats then calling planFiles() makes planning slower without
benefit. There are some cases though when many files could be skipped during
planning and doing this would make the query much faster. In most cases Impala
would still not read the whole data file, as Parquet min/max filtering would
drop row groups based on the footer, but knowing this during planning would
lead to a more efficient plan.
Some cases where planning time stat filtering could be useful:
1. sorting columns
2. column with some hidden relation to sorting or partition columns
3. cases when Impala's scanner can't do file level stat filtering - the obvious
examples are Avro files, but Parquet/ORC min/max filtering also has limitations
some examples for 2 ("hidden partitioning/clustering"):
a. table is partitioned by country of a point, filter is on x/y coordinates
b. table is sorted by send_date, but filtered by receive_date - the two are
likely to correlate
My idea is to provide a way to configure how Impala behaves (with the exception
of explicit sorting columns, where this info is available):
1. create a query option that enables calling planFiles() in all cases - this
is also useful to make Impala more consistent with other engines, for example
when benchmarking running Impala with both values can give useful insights
2. add a table property that lists the columns where predicates should be
pushed down to Iceberg - this could be also computed automatically by a tool
that checks overlaps among file min/max ranges
> Allow Iceberg predicate push down on non-partition columns
> ----------------------------------------------------------
>
> Key: IMPALA-14123
> URL: https://issues.apache.org/jira/browse/IMPALA-14123
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: iceberg, impala-iceberg, performance
>
> Since IMPALA-11591 Impala only calls Iceberg's planFiles() when at least one
> predicate can be pushed down for a partition column.
> https://github.com/apache/impala/blob/006f9ba589cc7d104bc81927ebba647cafc0d39b/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java#L828
> This is a useful optimization in general because if Iceberg can't skip many
> files based on stats then calling planFiles() makes planning slower without
> benefit. There are some cases though when many files could be skipped during
> planning and doing this would make the query much faster. In most cases
> Impala would still not read the whole data file, as Parquet min/max filtering
> would drop row groups based on the footer, but knowing this during planning
> would lead to a more efficient plan.
> Some cases where planning time stat filtering could be useful:
> 1. sorting columns
> 2. column with some hidden relation to sorting or partition columns
> 3. cases when Impala's scanner can't do file level stat filtering - the
> obvious examples are Avro files, but Parquet/ORC min/max filtering also has
> limitations
> some examples for 2 ("hidden partitioning/clustering"):
> a. table is partitioned by country of a point, filter is on x/y coordinates
> b. table is sorted by send_date, but filtered by receive_date - the two are
> likely to correlate
>
> My idea is to provide a way to configure how Impala behaves (with the
> exception of explicit sorting columns, where this info is available):
> 1. create a query option that enables calling planFiles() in all cases - this
> is also useful to make Impala more consistent with other engines, for example
> when benchmarking running Impala with both values can give useful insights
> 2. add a table property that lists the columns where predicates should be
> pushed down to Iceberg - this could be also computed automatically by a tool
> that checks overlaps among file min/max ranges
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]