Hi all, could someone please help me understand the broadcast life cycle in detail, especially with regard to memory management?
After reading through the TorrentBroadcast implementation, it seems that for every broadcast object, the driver holds a strong reference to a shallow copy (in MEMORY_AND_DISK) as well as a deep copy of the data in chunked form (in MEMORY_AND_DISK). Now my questions: 1) Is this observation correct or does the driver also hold a strong reference to the entire object in serialized form? 2) Are there scenarios, other than with local master or explicit reads in the driver, where the shallow copy is actually used by Spark? 3) Is it a valid workaround to create a wrapper object around the data, broadcast the wrapper, and immediately delete the data after it has been blockified to remove the unnecessary memory requirements? Regards, Matthias