alamb commented on issue #11628:
URL: https://github.com/apache/datafusion/issues/11628#issuecomment-2249093227

   > I believe the performance regression is due to late GC. Previously, we 
called GC immediately after we ran the filter. Now, we call GC only after we 
accumulate enough values in the buffer, 
   
   This makes sense to me and I think your analysis is very clear. Thank you
   
   > should we refactor filter-then-coalesce into one operator? In that way, we 
don't have intermediate small batches, thus reduce copy. This is a bigger 
project and can potentially solve the first problem along the way.
   
   I think this is what we should pursue and I think what is covered by 
https://github.com/apache/datafusion/issues/7957. As you say it is likely the 
thing that will perform the best. 
   
   Maybe we could explore a solution that builds an the output 
`StringViewArray` as data came in, rather than wait for enough data to be 
accumulated. The code might look like 
   
   ```rust
   while let Some(batch) = input.read_batch() {
     // append new rows to inprogress output, producing a complete batch if 
ready
     if let Some(output_batch) = coalescer.push_batch(batch) {
       output.emit(output_batch)
     }
   }
   ```
   
   The idea would be that `coalescer` stores an in-progress `StringViewBuilder` 
 so that as batches were pushed the data was copied 
   
   ```rust
   struct Coalescer {
     in_progress: StringViewBuilder 
     // and similiar things for other types 🤔 
   }
   
   impl Coalescer {
     fn push_bach(&mut self, batch: RecordBatch) -> Option<RecordBatch> {
       // copy relevant values to self.in_progress
       // if in_progress.len is greater than threshold emit a batch
     }
   }
   ```
   
   You might recognize this high level structure from 
https://github.com/apache/datafusion/pull/11610 :)
   
   
   
   > I think this is another example of getting StringView fast in practice 
requires a lot of careful analysis and implementation!
   
   100% agree


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to