Kontinuation commented on PR #14644: URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660748423
> Thank you @kazuyukitanimura for the PR, i applied the PR try to fix the testing, but the above testing is still failed for me, i am not sure if i am missing something. There are 2 problems: 1. The `DataSourceExec` may have many partitions, and each SortExec on that partition will only get a fair share of the 100MB pool, so each partition won't get enough memory to operate. This is still the case even with `worker_threads = 1`. If you still want to sort 4.2GB parquet file using 100MBs of memory, you can set `.with_target_partitions(1)` in your session config. 2. 100MB is not enough for the final merging with spill-reads. There will be roughly 200 spill files generated after ingesting all the batches, the size of a typical batch for this workload is 352380 bytes. The memory needed for merging will be 200 * (352380 bytes) * 2 > 100MB. Merging phase is unspillable so it requires a minimum amount of memory to operate. Raising the memory limit to 200MB will work for this particular workload. One possible fix for problem 2 is to use a smaller batch size when writing batches to spill files, so that the unspillable memory required for the final spill-read merging will be smaller. Or we simply leave this problem as is and requires the user to raise the memory limit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org