alamb commented on issue #19414:
URL: https://github.com/apache/datafusion/issues/19414#issuecomment-3676909238

   > Are there scenarios/flows where we GC and spill ? I haven't tested 
external sort for example
   > Is there a way to figure out that the incoming array is already GC'd ?
   
   Internally, the operations that do take/filter do the GC'ing internally -- 
specifically this the coalescer in arrow has a bunch of heuristics about when 
to GC a string view: 
https://docs.rs/arrow/latest/arrow/compute/struct.BatchCoalescer.html 
   
   Specifically I think this is the heuristic: 
https://github.com/apache/arrow-rs/blob/240cbf4f838387445b0209db4b14dbb277b05a12/arrow-select/src/coalesce/byte_view.rs#L286-L295
   
   > Also , does GC'ing always before spill makes sense ? are there scenarios 
where GC-ing is inefficient.
   
   GCing will copy *all* the referenced strings so if there isn't a large 
amount of "dead" data it may be faster just to write it all and read it all 
back. 
   
   When writing to a spill file, I do think it would make sense to GC all the 
strings if the load factor is small (see heuristic above) 
   
   > On further investigation , found that for each sliced record batch , we 
write the entire original array's buffer as the string view array was not GC'd.
   
   Yeah, this sounds non ideal to me
   
   > I was testing this issue with this commit that performs GC during spill 
for reference - 
https://github.com/bharath-techie/datafusion/commit/f4412373f72eceed8d0ee0614975144b2185ab88
   
   That change looks reasonable to me
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to