ding-young commented on code in PR #17315: URL: https://github.com/apache/datafusion/pull/17315#discussion_r2306363399
########## datafusion/physical-plan/src/spill/spill_manager.rs: ########## @@ -194,7 +195,84 @@ impl GetSlicedSize for RecordBatch { for array in self.columns() { let data = array.to_data(); total += data.get_slice_memory_size()?; + + // While StringViewArray holds large data buffer for non inlined string, the Arrow layout (BufferSpec) + // does not include any data buffers. Currently, ArrayData::get_slice_memory_size() + // under-counts memory size by accounting only views buffer although data buffer is cloned during slice() + // + // Therefore, we manually add the sum of the lengths used by all non inlined views + // on top of the sliced size for views buffer. This matches the intended semantics of + // "bytes needed if we materialized exactly this slice into fresh buffers". + // Note: if multiple arrays share the same data buffers, we may double count each StringViewArray. Review Comment: @ctsk Just to clarify, are you suggesting that we should revise the memory size calculation to be buffer-sharing aware (inside a StringViewArray)? Or was the intention simply to clarify in comments that such sharing may occur within a single array? If it’s the former, I’ll revise the logic accordingly and add a test to cover that case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org