alamb commented on code in PR #12471:
URL: https://github.com/apache/datafusion/pull/12471#discussion_r1761635636
##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec {
}
fn statistics(&self) -> Result<Statistics> {
- Ok(self.projected_statistics.clone())
+ // When filters are pushed down, we have no way of knowing the exact
statistics.
+ // Note that pruning predicate is also a kind of filter pushdown.
+ let stats = if self.pruning_predicate.is_some()
+ || self.page_pruning_predicate.is_some()
Review Comment:
I think bloom filters should also belong in this list 🤔
##########
datafusion/physical-plan/src/execution_plan.rs:
##########
@@ -379,6 +379,9 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
/// Returns statistics for this `ExecutionPlan` node. If statistics are not
/// available, should return [`Statistics::new_unknown`] (the default), not
/// an error.
+ ///
+ /// For TableScan executors, which supports filter pushdown, special
attention
Review Comment:
💯
##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -817,6 +817,20 @@ impl TableProvider for ListingTable {
.map(|col| Ok(self.table_schema.field_with_name(&col.0)?.clone()))
.collect::<Result<Vec<_>>>()?;
+ // If the filters can be resolved using only partition cols, there is
no need to
+ // pushdown it to TableScan, otherwise, `unhandled` pruning predicates
will be generated
+ let table_partition_col_names = table_partition_cols
Review Comment:
This makes sense -- but I don't see a test for it. I also feel like this
logic might be duplicated elsewhere(maybe the "can push filters code")?
What would you think about updating the fields on `ListingTableProvider`
somehow to reflect this knowledge a bit more centrally?
##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec {
}
fn statistics(&self) -> Result<Statistics> {
- Ok(self.projected_statistics.clone())
+ // When filters are pushed down, we have no way of knowing the exact
statistics.
+ // Note that pruning predicate is also a kind of filter pushdown.
+ let stats = if self.pruning_predicate.is_some()
+ || self.page_pruning_predicate.is_some()
+ || (self.predicate.is_some() && self.pushdown_filters())
+ {
+ Statistics::new_unknown(&self.schema())
Review Comment:
Rather than discarding all statistics, what do you think about marking any
precisions as `Precision::Inexact`?
I am a bit concerned that we would simply discard the statistics entirely in
the (very common) case of filter pushdown 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]