nuno-faria commented on PR #17275: URL: https://github.com/apache/datafusion/pull/17275#issuecomment-3266643038
I found a potential performance regression with `parquet 56.1.0`. Now more data pages will be returned if their size is less than the execution batch size. For example: ```rust use datafusion::error::Result; use datafusion::prelude::{ParquetReadOptions, SessionConfig, SessionContext}; #[tokio::main] async fn main() -> Result<()> { let config = SessionConfig::new().with_target_partitions(1); let ctx = SessionContext::new_with_config(config); ctx.sql("set datafusion.execution.parquet.pushdown_filters = true") .await? .collect() .await?; ctx.sql( " copy ( select i as k from generate_series(1, 1000000) as t(i) order by k ) to 't.parquet' options (MAX_ROW_GROUP_SIZE 100000, DATA_PAGE_ROW_COUNT_LIMIT 1000, WRITE_BATCH_SIZE 1000, DICTIONARY_ENABLED FALSE);", ) .await? .collect() .await?; ctx.register_parquet("t", "t.parquet", ParquetReadOptions::new()) .await?; ctx.sql("explain analyze select k from t where k = 123456") .await? .show() .await?; Ok(()) } ``` With `parquet 56.0.0`: ``` metrics=[..., bytes_scanned=1273, ...] # some debug info showing that a single page is retrieved total=1273 ranges=[132974..134247] ``` With `parquet 56.1.0`: ``` metrics=[..., bytes_scanned=9929, ...] # some debug info showing that multiple pages are retrieved total=9929 ranges=[125400..126482, 126482..127564, 127564..128646, 128646..129728, 129728..130810, 130810..131892, 131892..132974, 132974..134247, 134247..135329] ``` I think this is a consequence of https://github.com/apache/arrow-rs/pull/7850, more specifically https://github.com/apache/arrow-rs/blame/0c7cb2ac3f3132216a08fd557f9b1edc7f90060f/parquet/src/arrow/arrow_reader/selection.rs#L445. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org