ctsk commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988799373
I've re-encountered this issue and it (obviously, duh) gets amplified with
larger scale facctors (>100). For example, the top 3 items by cycles when
running tpch query 18 @ sf 300 are:
```
18.07% tokio-runtime-w tpch
[.] arrow_select::take::take_byte_view
◆
14.52% tokio-runtime-w tpch
[.] <datafusion_physical_plan::coalesce_batches::CoalesceBatchesStream as
futures_core::stream::Stream>::poll_next
▒
12.96% tokio-runtime-w tpch
[.]
core::ptr::drop_in_place<alloc::vec::Vec<arrow_buffer::buffer::immutable::Buffer>>
```
That is close to 50% of cycles spent due to this issue. Because the batches
that get produced by the hash join have a lot of data buffers, the
CoalesceBatchesExec gets more expensive too: Inside the operator, we iterate
over the data buffers to determine whether to perform a garbage collection.
I believe a reasonable fix for this issue is to perform a garbage
collection on the build side after the concatenation. The build side would then
only hold a single data buffer per string-view / byte-view column.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]