davisp commented on issue #16158: URL: https://github.com/apache/datafusion/issues/16158#issuecomment-2908080182
Registering my official +1 to default to collecting statistics. For reference, I was working on the TPC-H benchmarks with a scale factor of 20 which generates roughly 20GiB of CSV data to process. Without statistics, query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it never uses more than 1.9G and finishes in about 5s. I generally agree that statistics collection is probably a bit slower than not collecting, but my guess is that any folks that notice and/or care would be doing something like high frequency queries against tiny datasets which seems like a niche use case. However, I’d wager a minute amount of money that the benefits would be worth the cost around the 500MiB to 1GiB range (based on my assumption that statistics collection is just scanning row groups, not full table scans). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org