xuchen-plus commented on issue #12136: URL: https://github.com/apache/datafusion/issues/12136#issuecomment-2642135559
> Not sure why the sorted batches' memory is over 2.6x than the batches before sort. Some findings so far: 1. My test was reading a parquet file with mostly string columns and sort by one column. It seems that enabling string view would cause `get_record_batch_memory_size` produce overlarge values since multiple string view columns from multiple batches may share the same buffer. a) The memory counting at `insert_batch` may be larger than actual physical memory usage: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L301-L307 b) The memory counting is also too large after `in_mem_sort`: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L446-L450 2. I disabled string view for parquet reader, and increase the value of `sort_spill_reservation_bytes`, my test can be finished successfully, during which the disk spill works correctly. Some observations a) `sort_spill_reservation_bytes` should be set to a big enough value to hold: - Extra memory required during sorting each in memory batch. The memory consumption for the sorted batch could be larger than the original. - Extra memory for allocating `Rows` for `SortPreservingMergeStream`. b) The memory consumption in https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs#L439-L444 is not counted in the memory limit. During the `collect`, the sorted batches will be `interleave`d to produce merged batches by `BatchBuilder`. At this time both the sorted batches and the interleaved batches exist in memory and the memory consumption may be doubled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org