alamb commented on issue #19414: URL: https://github.com/apache/datafusion/issues/19414#issuecomment-3676909238
> Are there scenarios/flows where we GC and spill ? I haven't tested external sort for example > Is there a way to figure out that the incoming array is already GC'd ? Internally, the operations that do take/filter do the GC'ing internally -- specifically this the coalescer in arrow has a bunch of heuristics about when to GC a string view: https://docs.rs/arrow/latest/arrow/compute/struct.BatchCoalescer.html Specifically I think this is the heuristic: https://github.com/apache/arrow-rs/blob/240cbf4f838387445b0209db4b14dbb277b05a12/arrow-select/src/coalesce/byte_view.rs#L286-L295 > Also , does GC'ing always before spill makes sense ? are there scenarios where GC-ing is inefficient. GCing will copy *all* the referenced strings so if there isn't a large amount of "dead" data it may be faster just to write it all and read it all back. When writing to a spill file, I do think it would make sense to GC all the strings if the load factor is small (see heuristic above) > On further investigation , found that for each sliced record batch , we write the entire original array's buffer as the string view array was not GC'd. Yeah, this sounds non ideal to me > I was testing this issue with this commit that performs GC during spill for reference - https://github.com/bharath-techie/datafusion/commit/f4412373f72eceed8d0ee0614975144b2185ab88 That change looks reasonable to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
