nuno-faria commented on PR #17275:
URL: https://github.com/apache/datafusion/pull/17275#issuecomment-3266643038

   I found a potential performance regression with `parquet 56.1.0`. Now more 
data pages will be returned if their size is less than the execution batch 
size. For example:
   
   ```rust
   use datafusion::error::Result;
   use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext};
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let config = SessionConfig::new().with_target_partitions(1);
       let ctx = SessionContext::new_with_config(config);
       ctx.sql("set datafusion.execution.parquet.pushdown_filters = true")
           .await?
           .collect()
           .await?;
   
       ctx.sql(
           "
           copy (
               select i as k
               from generate_series(1, 1000000) as t(i)
               order by k
           ) to 't.parquet'
           options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000, 
WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);",
       )
       .await?
       .collect()
       .await?;
   
       ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new())
           .await?;
   
       ctx.sql("explain analyze select k from t where k = 123456")
           .await?
           .show()
           .await?;
   
       Ok(())
   }
   ```
   
   With `parquet 56.0.0`:
   ```
   metrics=[..., bytes_scanned=1273, ...]
   
   # some debug info showing that a single page is retrieved
   total=1273
   ranges=[132974..134247]
   ```
   
   With `parquet 56.1.0`:
   ```
   metrics=[..., bytes_scanned=9929, ...]
   
   # some debug info showing that multiple pages are retrieved
   total=9929
   ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728, 
129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329]
   ```
   
   I think this is a consequence of 
https://github.com/apache/arrow-rs/pull/7850, more specifically 
https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to