Re: [I] Default to collecting statistics when creating LIstingTables [datafusion]

via GitHub Sun, 25 May 2025 14:04:49 -0700


davisp commented on issue #16158:
URL: https://github.com/apache/datafusion/issues/16158#issuecomment-2908080182


   Registering my official +1 to default to collecting statistics.
   
   For reference, I was working on the TPC-H benchmarks with a scale factor of 
20 which generates roughly 20GiB of CSV data to process. Without statistics, 
query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it 
never uses more than 1.9G and finishes in about 5s.
   
   I generally agree that statistics collection is probably a bit slower than 
not collecting, but my guess is that any folks that notice and/or care would be 
doing something like high frequency queries against tiny datasets which seems 
like a niche use case. However, I’d wager a minute amount of money that the 
benefits would be worth the cost around the 500MiB to 1GiB range (based on my 
assumption that statistics collection is just scanning row groups, not full 
table scans).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Default to collecting statistics when creating LIstingTables [datafusion]

Reply via email to