alamb commented on PR #16659: URL: https://github.com/apache/datafusion/pull/16659#issuecomment-3028490668
This is a somewhat subtle issue so I will try and summarize: In DataFusion 47 and earlier, 1. Calling `DataFrame::register_parquet` collected statistics for the table at create time (slower to create table, potentially faster quer) 2. Calling `CREATE EXTERNAL TABLE` did not collect statistics (faster to create table, but potentially slower query) There are more details about this on the ticket from @davisp here: - https://github.com/apache/datafusion/issues/15908 In DataFusion 48.0.0: 1. https://github.com/apache/datafusion/pull/16080 made `DataFrame::register_parquet` and `CREATE EXTERNAL TABLE` DID NOT collect statistics. However this means that users who were relying on statistics, such as @AdamGS , saw queries get slower (see https://github.com/apache/datafusion/issues/16444) Thus this PR proposes changing `DataFusion 48.0.1` so 1. Both `DataFrame::register_parquet` and `CREATE EXTERNAL TABLE` **WILL** collect statistics. Note that this is consistent with the behavior on the latest `main` (what will be released in DataFusion 49.0.0): - https://github.com/apache/datafusion/issues/16158 Since we have already made this change, the thinking is that by changing 48.0.1 we'll avoid people full migrating to the "no default statistics" behavior only to have to change back again in 49 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org