[I] Potential revamp of broadcast compression policy [datafusion-comet]

via GitHub Fri, 22 Aug 2025 07:17:29 -0700


akupchinskiy opened a new issue, #2216:
URL: https://github.com/apache/datafusion-comet/issues/2216

### What is the problem the feature request solves?

Currently, broadcast batches are compressed twice by the same codec:

1. During rdd formation on the comet side
[here](https://github.com/apache/datafusion-comet/blob/8112e1acab497ca3a915d4ab3fdce4ce9e64c88a/common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala#L207)
2. When the blocks creation phase initiated by spark
[here](https://github.com/apache/spark/blob/7007e1c7ad646bfdc2a89579b2abaa2b3facc6af/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L365)

It doesn't cause statistically significant improvement on the standard
benchmarks (typically broadcasts are just tiny parts of the whole plan in terms
of data movements), but it might hurt workloads where 100MB-1GB data chunks are
explicitly targeted for being broadcast via hint or configuration.

So, the point is checking two assumptions:

1. The query performance is notably regressed on queries where larger data
portions are broadcast.
2. The double compression brings no benefits in terms of inter-node traffic
volume.

If both are true, maybe it is worth turning off/making configurable
broadcast compression on the comet side.

### Describe the potential solution

_No response_

### Additional context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Potential revamp of broadcast compression policy [datafusion-comet]

Reply via email to