Kontinuation commented on PR #1511:
URL: 
https://github.com/apache/datafusion-comet/pull/1511#issuecomment-2736412022

   I have ran TPC-H SF=100 benchmarks with various off-heap size 
configurations. The results showed that 
   * interleave_record_batch is slower than main branch when no spilling 
happens.
   * interleave_record_batch is faster running Q10 when off-heap memory is less 
than 8GB, while the main branch could be slower than Spark because of excessive 
spilling.
   
   The following table shows detailed results.
   
   | On-heap size | Off-heap size | Spark 3.5.4 | Comet main | Comet 
interleave_record_batch | Bar plot |
   |--|--|--|--|--|--|
   | 3g | 3g | 1054 s | 551 s | 523 s | 
![tpch_queries_compare_3g](https://github.com/user-attachments/assets/668d2f0f-a5a6-470e-b3fa-b7c2875460d6)
 |
   | 3g | 5g | 1050s | 512s | 522s | 
![tpch_queries_compare_5g](https://github.com/user-attachments/assets/267017bf-079d-43a7-9600-53237a2c0392)
 |
   | 3g | 8g | 1032s | 490s | 492s | 
![tpch_queries_compare_8g](https://github.com/user-attachments/assets/4c020722-2dfc-4e65-aa3d-6efa4e831f62)
 |
   
   Comet main could be slower when running Q10 because it suffers from 
excessive spilling. Q10 shuffle writes batches containing string columns, the 
current shuffle writer implementation pre-allocates lots of space for string 
array builders so it consumes lots of memory even when only a few batches were 
ingested. We've already seen this in 
https://github.com/apache/datafusion-comet/issues/887.
   
   Here is the comparison of Spark metrics for CometExchange nodes:
   
   | Comet main | Comet interleave_record_batch |
   |--|--|
   | <img width="574" alt="comet-main-exchange" 
src="https://github.com/user-attachments/assets/5850e81e-8153-4be5-a7e9-aa7bd7cf021d";
 /> | <img width="571" alt="comet-interleave-exchange" 
src="https://github.com/user-attachments/assets/3c4e30b7-8ab6-469b-a6ce-111efce98e6f";
 /> |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to