Yifan Cai created CASSANALYTICS-89:
--------------------------------------

             Summary: Create dedicated data class for broadcast variable during 
bulk write
                 Key: CASSANALYTICS-89
                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-89
             Project: Apache Cassandra Analytics
          Issue Type: Improvement
          Components: Writer
            Reporter: Yifan Cai


Bulk write in Analytics uses Spark’s broadcast variable feature to distribute 
job information (BulkWriterContext) to executors. While this works today, it 
triggers unnecessary work in Spark’s SizeEstimator, which inspects all 
fields—including transient ones. Since BulkWriterContext (and the objects it 
references) contain many transient fields that aren’t meant to be serialized, 
SizeEstimator still walks them via reflection, wasting CPU cycles.

A cleaner approach would be to introduce a dedicated data class for the 
broadcast variable, with only the minimal set of fields required for 
distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to