akupchinskiy opened a new issue, #2216:
URL: https://github.com/apache/datafusion-comet/issues/2216

   ### What is the problem the feature request solves?
   
   Currently, broadcast batches are compressed twice by the same codec:
   
   1. During rdd formation on the comet side 
[here](https://github.com/apache/datafusion-comet/blob/8112e1acab497ca3a915d4ab3fdce4ce9e64c88a/common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala#L207)
   2. When the blocks creation phase initiated by spark 
[here](https://github.com/apache/spark/blob/7007e1c7ad646bfdc2a89579b2abaa2b3facc6af/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L365)
   
   It doesn't cause statistically significant improvement on the standard 
benchmarks (typically broadcasts are just tiny parts of the whole plan in terms 
of data movements), but it might hurt workloads where 100MB-1GB data chunks are 
explicitly targeted for being broadcast via hint or configuration. 
   
   So, the point is checking two assumptions:
   
   1. The query performance is notably regressed on queries where larger data 
portions are broadcast.
   2. The double compression brings no benefits in terms of inter-node traffic 
volume. 
   
   If both are true, maybe it is worth turning off/making configurable 
broadcast compression on the comet side.
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to