Kontinuation commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660748423

   > Thank you @kazuyukitanimura for the PR, i applied the PR try to fix the 
testing, but the above testing is still failed for me, i am not sure if i am 
missing something.
   
   There are 2 problems:
   
   1. The `DataSourceExec` may have many partitions, and each SortExec on that 
partition will only get a fair share of the 100MB pool, so each partition won't 
get enough memory to operate. This is still the case even with `worker_threads 
= 1`. If you still want to sort 4.2GB parquet file using 100MBs of memory, you 
can set `.with_target_partitions(1)` in your session config.
   2. 100MB is not enough for the final merging with spill-reads. There will be 
roughly 200 spill files generated after ingesting all the batches, the size of 
a typical batch for this workload is 352380 bytes. The memory needed for 
merging will be 200 * (352380 bytes) * 2 > 100MB. Merging phase is unspillable 
so it requires a minimum amount of memory to operate. Raising the memory limit 
to 200MB will work for this particular workload.
   
   One possible fix for problem 2 is to use a smaller batch size when writing 
batches to spill files, so that the unspillable memory required for the final 
spill-read merging will be smaller. Or we simply leave this problem as is and 
requires the user to raise the memory limit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to