alamb commented on PR #20820:
URL: https://github.com/apache/datafusion/pull/20820#issuecomment-4104417783

   > One part that seems clear to me now based on some benchmarking is that we 
will want to split morsels into smaller ones if we're running out of them (i.e. 
no more files to morselize and number of morsels is < threads).
   
   
   This is super prescient -- I think one issue is that each file has only a 
few large row groups
   
   For exmaple, `benchmarks/data/hits_partitioned/hits_32.parquet` has 4 row 
groups each with 100k rows
   
   <img width="1218" height="1051" alt="Screenshot 2026-03-21 at 5 08 54 PM" 
src="https://github.com/user-attachments/assets/c3734147-20e1-4bef-ac20-ad91cbf2442e";
 />
   
   I am trying a slightly different strategy now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to