xuchen-plus commented on issue #12136:
URL: https://github.com/apache/datafusion/issues/12136#issuecomment-2642135559

   > Not sure why the sorted batches' memory is over 2.6x than the batches 
before sort.
   
   Some findings so far:
   1. My test was reading a parquet file with mostly string columns and sort by 
one column. It seems that enabling string view would cause 
`get_record_batch_memory_size` produce overlarge values since multiple string 
view columns from multiple batches may share the same buffer.
   a) The memory counting at `insert_batch` may be larger than actual physical 
memory usage:
   
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L301-L307
   b) The memory counting is also too large after `in_mem_sort`:
   
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L446-L450
   2. I disabled string view for parquet reader, and increase the value of 
`sort_spill_reservation_bytes`, my test can be finished successfully, during 
which the disk spill works correctly. Some observations
   a) `sort_spill_reservation_bytes` should be set to a big enough value to 
hold:
          
         - Extra memory required during sorting each in memory batch. The 
memory consumption for the sorted batch could be larger than the original.
         - Extra memory for allocating `Rows` for `SortPreservingMergeStream`.
   
   b) The memory consumption in
   
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L439-L444
   
   is not counted in the memory limit. During the `collect`, the sorted batches 
will be `interleave`d to produce merged batches by `BatchBuilder`. At this time 
both the sorted batches and the interleaved batches exist in memory and the 
memory consumption may be doubled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to