alamb commented on PR #20820: URL: https://github.com/apache/datafusion/pull/20820#issuecomment-4104417783
> One part that seems clear to me now based on some benchmarking is that we will want to split morsels into smaller ones if we're running out of them (i.e. no more files to morselize and number of morsels is < threads). This is super prescient -- I think one issue is that each file has only a few large row groups For exmaple, `benchmarks/data/hits_partitioned/hits_32.parquet` has 4 row groups each with 100k rows <img width="1218" height="1051" alt="Screenshot 2026-03-21 at 5 08 54 PM" src="https://github.com/user-attachments/assets/c3734147-20e1-4bef-ac20-ad91cbf2442e" /> I am trying a slightly different strategy now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
