akupchinskiy opened a new issue, #2216: URL: https://github.com/apache/datafusion-comet/issues/2216
### What is the problem the feature request solves? Currently, broadcast batches are compressed twice by the same codec: 1. During rdd formation on the comet side [here](https://github.com/apache/datafusion-comet/blob/8112e1acab497ca3a915d4ab3fdce4ce9e64c88a/common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala#L207) 2. When the blocks creation phase initiated by spark [here](https://github.com/apache/spark/blob/7007e1c7ad646bfdc2a89579b2abaa2b3facc6af/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L365) It doesn't cause statistically significant improvement on the standard benchmarks (typically broadcasts are just tiny parts of the whole plan in terms of data movements), but it might hurt workloads where 100MB-1GB data chunks are explicitly targeted for being broadcast via hint or configuration. So, the point is checking two assumptions: 1. The query performance is notably regressed on queries where larger data portions are broadcast. 2. The double compression brings no benefits in terms of inter-node traffic volume. If both are true, maybe it is worth turning off/making configurable broadcast compression on the comet side. ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org