[
https://issues.apache.org/jira/browse/CASSANALYTICS-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yifan Cai updated CASSANALYTICS-89:
-----------------------------------
Fix Version/s: 0.2
Source Control Link:
https://github.com/apache/cassandra-analytics/commit/f960685e0ee9fafc0e4819ff40a77179f0f7b859
Resolution: Fixed
Status: Resolved (was: Ready to Commit)
> Create dedicated data class for broadcast variable during bulk write
> --------------------------------------------------------------------
>
> Key: CASSANALYTICS-89
> URL: https://issues.apache.org/jira/browse/CASSANALYTICS-89
> Project: Apache Cassandra Analytics
> Issue Type: Improvement
> Components: Writer
> Reporter: Yifan Cai
> Assignee: Yifan Cai
> Priority: Normal
> Fix For: 0.2
>
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> Bulk write in Analytics uses Spark’s broadcast variable feature to distribute
> job information (BulkWriterContext) to executors. While this works today, it
> triggers unnecessary work in Spark’s SizeEstimator, which inspects all
> fields—including transient ones. Since BulkWriterContext (and the objects it
> references) contain many transient fields that aren’t meant to be serialized,
> SizeEstimator still walks them via reflection, wasting CPU cycles.
> A cleaner approach would be to introduce a dedicated data class for the
> broadcast variable, with only the minimal set of fields required for
> distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]