2010YOUY01 commented on issue #7957: URL: https://github.com/apache/datafusion/issues/7957#issuecomment-2249942451
TPCH-Q1's expensive `CoalesceBatches` might be easier to solve 🤔 Unlike TPCH-Q8 (looks like with more selective predicate), Q1's filter only throws away a small number of rows: ``` input_batch(4096 rows) -> [FILTER] -> output_batch(~4000 rows) ``` And the following `CoalesceBatches` condition will be triggered every time to copy large output batches https://github.com/apache/datafusion/blob/49d9d45f36989cd448ed6513af65948b6b0100ec/datafusion/physical-plan/src/coalesce_batches.rs#L228 But if it didn't do coalescing, the output batch still benefits from vectorization, so maybe this coalescing threshold can be better tuned to like `if self.buffered_rows >= 0.6 * self.target_batch_size { ` I remember I tried before, but the overall performance on Q1 is like 2% improvement, I think it's possible to set a better threshold for triggering coalescing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
