alamb commented on issue #11628: URL: https://github.com/apache/datafusion/issues/11628#issuecomment-2249093227
> I believe the performance regression is due to late GC. Previously, we called GC immediately after we ran the filter. Now, we call GC only after we accumulate enough values in the buffer, This makes sense to me and I think your analysis is very clear. Thank you > should we refactor filter-then-coalesce into one operator? In that way, we don't have intermediate small batches, thus reduce copy. This is a bigger project and can potentially solve the first problem along the way. I think this is what we should pursue and I think what is covered by https://github.com/apache/datafusion/issues/7957. As you say it is likely the thing that will perform the best. Maybe we could explore a solution that builds an the output `StringViewArray` as data came in, rather than wait for enough data to be accumulated. The code might look like ```rust while let Some(batch) = input.read_batch() { // append new rows to inprogress output, producing a complete batch if ready if let Some(output_batch) = coalescer.push_batch(batch) { output.emit(output_batch) } } ``` The idea would be that `coalescer` stores an in-progress `StringViewBuilder` so that as batches were pushed the data was copied ```rust struct Coalescer { in_progress: StringViewBuilder // and similiar things for other types 🤔 } impl Coalescer { fn push_bach(&mut self, batch: RecordBatch) -> Option<RecordBatch> { // copy relevant values to self.in_progress // if in_progress.len is greater than threshold emit a batch } } ``` You might recognize this high level structure from https://github.com/apache/datafusion/pull/11610 :) > I think this is another example of getting StringView fast in practice requires a lot of careful analysis and implementation! 100% agree -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
