alamb commented on PR #16659:
URL: https://github.com/apache/datafusion/pull/16659#issuecomment-3028490668

   This is a somewhat subtle issue so I will try and summarize:
   
   In DataFusion 47 and earlier,
   1. Calling `DataFrame::register_parquet` collected statistics for the table 
at create time (slower to create table, potentially faster quer)
   2. Calling `CREATE EXTERNAL TABLE` did not collect statistics (faster to 
create table, but potentially slower query)
   
   There are more details about this on the ticket from @davisp  here:
    - https://github.com/apache/datafusion/issues/15908
   
   In DataFusion 48.0.0:
   1. https://github.com/apache/datafusion/pull/16080 made 
`DataFrame::register_parquet` and `CREATE EXTERNAL TABLE` DID NOT collect 
statistics. 
   
   However this means that users who were relying on statistics, such as 
@AdamGS , saw queries get slower (see 
https://github.com/apache/datafusion/issues/16444)
   
   Thus this PR proposes changing `DataFusion 48.0.1` so
   1. Both `DataFrame::register_parquet` and `CREATE EXTERNAL TABLE` **WILL** 
collect statistics. 
   
   Note that this is consistent with the behavior on the latest `main` (what 
will be released in DataFusion 49.0.0):
   -  https://github.com/apache/datafusion/issues/16158
   
   Since we have already made this change, the thinking is that by changing 
48.0.1 we'll avoid people full migrating to the "no default statistics" 
behavior only to have to change back again in 49
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to