Re: [I] Avoid extra copies in `CoalesceBatchesExec` to improve performance [datafusion]

via GitHub Thu, 25 Jul 2024 02:57:38 -0700


2010YOUY01 commented on issue #7957:
URL: https://github.com/apache/datafusion/issues/7957#issuecomment-2249942451


   TPCH-Q1's expensive `CoalesceBatches` might be easier to solve 🤔 
   Unlike TPCH-Q8 (looks like with more selective predicate), Q1's filter only 
throws away a small number of rows:
   ```
   input_batch(4096 rows) -> [FILTER] -> output_batch(~4000 rows)
   ```
   And the following `CoalesceBatches` condition will be triggered every time 
to copy large output batches
   
https://github.com/apache/datafusion/blob/49d9d45f36989cd448ed6513af65948b6b0100ec/datafusion/physical-plan/src/coalesce_batches.rs#L228
   But if it didn't do coalescing, the output batch still benefits from 
vectorization, so maybe this coalescing threshold can be better tuned to like 
`if self.buffered_rows >= 0.6 * self.target_batch_size { `
   
   I remember I tried before, but the overall performance on Q1 is like 2% 
improvement, I think it's possible to set a better threshold for triggering 
coalescing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Avoid extra copies in `CoalesceBatchesExec` to improve performance [datafusion]

Reply via email to