alamb commented on code in PR #11587:
URL: https://github.com/apache/datafusion/pull/11587#discussion_r1687100275
##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -290,6 +294,46 @@ pub fn concat_batches(
arrow::compute::concat_batches(schema, batches)
}
+/// `StringViewArray` reference to the raw parquet decoded buffer, which
reduces copy but prevents those buffer from being released.
+/// When `StringViewArray`'s cardinality significantly drops (e.g., after
`FilterExec` or `HashJoinExec` or many others),
+/// we should consider consolidating it so that we can release the buffer to
reduce memory usage and improve string locality for better performance.
+fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch {
+ let new_columns: Vec<ArrayRef> = batch
+ .columns()
+ .iter()
+ .map(|c| {
+ // Try to re-create the `StringViewArray` to prevent holding the
underlying buffer too long.
+ if let Some(s) = c.as_string_view_opt() {
Review Comment:
A minor comment here is that I think you can reduce the level of nesting by
using syntax like
```rust
let Some(s) = c.as_string_view_opt() else {
return Arc::clone(c)
}
```
##########
datafusion/physical-plan/src/coalesce_batches.rs:
##########
@@ -216,6 +218,8 @@ impl CoalesceBatchesStream {
match input_batch {
Poll::Ready(x) => match x {
Some(Ok(batch)) => {
+ let batch = gc_string_view_batch(&batch);
Review Comment:
This is an excellent point
I think given how concat is implemented for `StringView` it will only copy
the fixed parts (not the actual string data)
Perhaps what we could do is implement a wrapper around arrow::concat_batches
that has the datafusion specific GC trigger for sparse arrays, and falls back
to concat for other types:
https://docs.rs/arrow-select/52.1.0/src/arrow_select/concat.rs.html#150
```rust
/// wrapper around [`arrow::compute::concat`] that
pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {
// loop over columns here and handle StringView specially,
// or fallback to concat
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]