alamb commented on code in PR #12471:
URL: https://github.com/apache/datafusion/pull/12471#discussion_r1769532838
##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -741,7 +741,17 @@ impl ExecutionPlan for ParquetExec {
}
fn statistics(&self) -> Result<Statistics> {
- Ok(self.projected_statistics.clone())
+ // When filters are pushed down, we have no way of knowing the exact
statistics.
+ // Note that pruning predicate is also a kind of filter pushdown.
Review Comment:
```suggestion
// Note that pruning predicate is also a kind of filter pushdown.
// (bloom filters use `pruning_predicate` too)
```
##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec {
}
fn statistics(&self) -> Result<Statistics> {
- Ok(self.projected_statistics.clone())
+ // When filters are pushed down, we have no way of knowing the exact
statistics.
+ // Note that pruning predicate is also a kind of filter pushdown.
+ let stats = if self.pruning_predicate.is_some()
+ || self.page_pruning_predicate.is_some()
Review Comment:
Makes sense, maybe we can update the comments to help future readers. I left
a suggestion
##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -817,6 +817,20 @@ impl TableProvider for ListingTable {
.map(|col| Ok(self.table_schema.field_with_name(&col.0)?.clone()))
.collect::<Result<Vec<_>>>()?;
+ // If the filters can be resolved using only partition cols, there is
no need to
+ // pushdown it to TableScan, otherwise, `unhandled` pruning predicates
will be generated
+ let table_partition_col_names = table_partition_cols
Review Comment:
I guess I was thinking of "if we removed this code by accident or in some
refactoring, would any test change"?
If the answer is no, that means we are relying on reviewers to avoid such a
regression, which given our available review capacity (never as much as we
would like!) is likely not very robust
I actually checked by removing this code and running the tests and indeed it
does fail as you say
```
failures:
---- sql::path_partition::parquet_statistics stdout ----
thread 'sql::path_partition::parquet_statistics' panicked at
datafusion/core/tests/sql/path_partition.rs:492:5:
assertion `left == right` failed
left: Inexact(1)
right: Exact(1)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
sql::path_partition::parquet_statistics
test result: FAILED. 245 passed; 1 failed; 0 ignored; 0 measured; 0 filtered
out; finished in 2.57s
error: test failed, to rerun pass `-p datafusion --test core_integration`
```
I personally think it would be clearer to have a separate test for this
pushdown behavior rather than relying on the statistics test, but the code is
covered
##########
datafusion/core/tests/parquet/file_statistics.rs:
##########
@@ -36,8 +37,33 @@ use datafusion_execution::config::SessionConfig;
use datafusion_execution::runtime_env::RuntimeEnvBuilder;
use datafusion::execution::session_state::SessionStateBuilder;
+use datafusion_expr::{BinaryExpr, Expr};
use tempfile::tempdir;
+#[tokio::test]
+async fn check_stats_precision_with_filter_pushdown() {
+ let testdata = datafusion::test_util::parquet_test_data();
+ let filename = format!("{}/{}", testdata, "alltypes_plain.parquet");
+ let table_path = ListingTableUrl::parse(filename).unwrap();
+
+ let opt = ListingOptions::new(Arc::new(ParquetFormat::default()));
+ let table = get_listing_table(&table_path, None, &opt).await;
+ let (_, _, state) = get_cache_runtime_state();
+ // Scan without filter, stats are exact
+ let exec = table.scan(&state, None, &[], None).await.unwrap();
+ assert_eq!(exec.statistics().unwrap().num_rows, Precision::Exact(8));
+
+ // Scan with filter pushdown, stats are inexact
Review Comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]