Re: [PR] return absent stats when filters are pushed down [datafusion]

via GitHub Sat, 21 Sep 2024 04:06:05 -0700


alamb commented on code in PR #12471:
URL: https://github.com/apache/datafusion/pull/12471#discussion_r1769532838



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -741,7 +741,17 @@ impl ExecutionPlan for ParquetExec {
     }
 
     fn statistics(&self) -> Result<Statistics> {
-        Ok(self.projected_statistics.clone())
+        // When filters are pushed down, we have no way of knowing the exact 
statistics.
+        // Note that pruning predicate is also a kind of filter pushdown.

Review Comment:
   ```suggestion
           // Note that pruning predicate is also a kind of filter pushdown.
           // (bloom filters use `pruning_predicate` too)
   ```



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec {
     }
 
     fn statistics(&self) -> Result<Statistics> {
-        Ok(self.projected_statistics.clone())
+        // When filters are pushed down, we have no way of knowing the exact 
statistics.
+        // Note that pruning predicate is also a kind of filter pushdown.
+        let stats = if self.pruning_predicate.is_some()
+            || self.page_pruning_predicate.is_some()

Review Comment:
   Makes sense, maybe we can update the comments to help future readers. I left 
a suggestion



##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -817,6 +817,20 @@ impl TableProvider for ListingTable {
             .map(|col| Ok(self.table_schema.field_with_name(&col.0)?.clone()))
             .collect::<Result<Vec<_>>>()?;
 
+        // If the filters can be resolved using only partition cols, there is 
no need to
+        // pushdown it to TableScan, otherwise, `unhandled` pruning predicates 
will be generated
+        let table_partition_col_names = table_partition_cols

Review Comment:
   I guess I was thinking of "if we removed this code by accident or in some 
refactoring, would any test change"? 
   
   If the answer is no, that means we are relying on reviewers to avoid such a 
regression, which given our available review capacity (never as much as we 
would like!) is likely not very robust
   
   I actually checked by removing this code and running the tests and indeed it 
does fail as you say
   
   ```
   failures:
   
   ---- sql::path_partition::parquet_statistics stdout ----
   thread 'sql::path_partition::parquet_statistics' panicked at 
datafusion/core/tests/sql/path_partition.rs:492:5:
   assertion `left == right` failed
     left: Inexact(1)
    right: Exact(1)
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   
   failures:
       sql::path_partition::parquet_statistics
   
   test result: FAILED. 245 passed; 1 failed; 0 ignored; 0 measured; 0 filtered 
out; finished in 2.57s
   
   error: test failed, to rerun pass `-p datafusion --test core_integration`
   ```
   
   I personally think it would be clearer to have a separate test for this 
pushdown behavior rather than relying on the statistics test, but the code is 
covered



##########
datafusion/core/tests/parquet/file_statistics.rs:
##########
@@ -36,8 +37,33 @@ use datafusion_execution::config::SessionConfig;
 use datafusion_execution::runtime_env::RuntimeEnvBuilder;
 
 use datafusion::execution::session_state::SessionStateBuilder;
+use datafusion_expr::{BinaryExpr, Expr};
 use tempfile::tempdir;
 
+#[tokio::test]
+async fn check_stats_precision_with_filter_pushdown() {
+    let testdata = datafusion::test_util::parquet_test_data();
+    let filename = format!("{}/{}", testdata, "alltypes_plain.parquet");
+    let table_path = ListingTableUrl::parse(filename).unwrap();
+
+    let opt = ListingOptions::new(Arc::new(ParquetFormat::default()));
+    let table = get_listing_table(&table_path, None, &opt).await;
+    let (_, _, state) = get_cache_runtime_state();
+    // Scan without filter, stats are exact
+    let exec = table.scan(&state, None, &[], None).await.unwrap();
+    assert_eq!(exec.statistics().unwrap().num_rows, Precision::Exact(8));
+
+    // Scan with filter pushdown, stats are inexact

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] return absent stats when filters are pushed down [datafusion]

Reply via email to