Kontinuation commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2658925999

   > > 2. `shrink_to_fit` every sorted batches reduce the memory footprint of 
sorted batches, otherwise sorted string arrays may take 2X the original space 
in the worst case, due to exponential growth of `MutableBuffer` for storing 
variable length binary values. `shrink_to_fit` is a no-op for primitive-type 
columns returned by `take_arrays` since they already have the right capacity, 
and benchmarking showed no significant performance regression for non-primitive 
types such as string arrays, so I think it is a good change. This resolves "the 
first problem" reported by [Memory account not adding up in SortExec 
#10073](https://github.com/apache/datafusion/issues/10073).
   > 
   > I think the buffer resizing mechanism is not doubling each time, the 
default policy will allocate new constant size buffers 
https://docs.rs/arrow-array/54.1.0/src/arrow_array/builder/generic_bytes_view_builder.rs.html#120-122,
 so this change might not help
   
   Actually it helps. I have added a new test case 
`test_sort_spill_utf8_strings`. It will fail after removing the `shrink_to_fit` 
calls.
   
   Here is where the 2X buffer growth come from:
   
   1. [sort_batch calls 
`take_arrays`](https://github.com/apache/datafusion/blob/45.0.0/datafusion/physical-plan/src/sorts/sort.rs#L646),
 which calls `take_bytes` for string columns
   2. `take_bytes` [allocates a 
`MutableBuffer`](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-select/src/take.rs#L473)
 for storing strings taken from the input array
   3. `take_bytes` [calls the `extend_from_slice` 
method](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-select/src/take.rs#L479)
 of the values mutable buffer to append strings to the buffer, which in turn 
[calls 
`reserve`](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-buffer/src/buffer/mutable.rs#L375)
 to grow its space
   4. `reserve` [grows the size 
exponentially](https://github.com/apache/arrow-rs/blob/54.1.0/arrow-buffer/src/buffer/mutable.rs#L195)
 by a factor of 2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to