Kontinuation commented on PR #14644: URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2658925999
> > 2. `shrink_to_fit` every sorted batches reduce the memory footprint of sorted batches, otherwise sorted string arrays may take 2X the original space in the worst case, due to exponential growth of `MutableBuffer` for storing variable length binary values. `shrink_to_fit` is a no-op for primitive-type columns returned by `take_arrays` since they already have the right capacity, and benchmarking showed no significant performance regression for non-primitive types such as string arrays, so I think it is a good change. This resolves "the first problem" reported by [Memory account not adding up in SortExec #10073](https://github.com/apache/datafusion/issues/10073). > > I think the buffer resizing mechanism is not doubling each time, the default policy will allocate new constant size buffers https://docs.rs/arrow-array/54.1.0/src/arrow_array/builder/generic_bytes_view_builder.rs.html#120-122, so this change might not help Actually it helps. I have added a new test case `test_sort_spill_utf8_strings`. It will fail after removing the `shrink_to_fit` calls. Here is where the 2X buffer growth come from: 1. [sort_batch calls `take_arrays`](https://github.com/apache/datafusion/blob/45.0.0/datafusion/physical-plan/src/sorts/sort.rs#L646), which calls `take_bytes` for string columns 2. `take_bytes` [allocates a `MutableBuffer`](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-select/src/take.rs#L473) for storing strings taken from the input array 3. `take_bytes` [calls the `extend_from_slice` method](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-select/src/take.rs#L479) of the values mutable buffer to append strings to the buffer, which in turn [calls `reserve`](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-buffer/src/buffer/mutable.rs#L375) to grow its space 4. `reserve` [grows the size exponentially](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-buffer/src/buffer/mutable.rs#L195) by a factor of 2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org