ctsk commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988799373

   I've re-encountered this issue and it (obviously, duh) gets amplified with 
larger scale facctors (>100). For example, the top 3 items by cycles when 
running tpch query 18 @ sf 300 are:
   
   ```
     18.07%  tokio-runtime-w  tpch                                              
[.] arrow_select::take::take_byte_view                                          
                                                                                
                                          ◆
     14.52%  tokio-runtime-w  tpch                                              
[.] <datafusion_physical_plan::coalesce_batches::CoalesceBatchesStream as 
futures_core::stream::Stream>::poll_next                                        
                                                ▒
     12.96%  tokio-runtime-w  tpch                                              
[.] 
core::ptr::drop_in_place<alloc::vec::Vec<arrow_buffer::buffer::immutable::Buffer>>
    
   ``` 
   
   That is close to 50% of cycles spent due to this issue. Because the batches 
that get produced by the hash join have a lot of data buffers, the 
CoalesceBatchesExec gets more expensive too: Inside the operator, we iterate 
over the data buffers to determine whether to perform a garbage collection.
   
   I believe a reasonable fix for this issue is  to perform a garbage 
collection on the build side after the concatenation. The build side would then 
only hold a single data buffer per string-view / byte-view column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to