How about this old discussion related to similar problem as yours. http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html Yong
From: aper...@timerazor.com Date: Wed, 29 Jun 2016 14:00:07 +0000 Subject: Possible to broadcast a function? To: user@spark.apache.org The user guide describes a broadcast as a way to move a large dataset to each node: "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner." And the broadcast example shows it being used with a variable. But, is it somehow possible to instead broadcast a function that can be executed once, per node? My use case is the following: I have a large data structure that I currently create on each executor. The way that I create it is a hack. That is, when the RDD function is executed on the executor, I block, load a bunch of data (~250 GiB) from an external data source, create the data structure as a static object in the JVM, and then resume execution. This works, but it ends up costing me a lot of extra memory (i.e. a few TiB when I have a lot of executors). What I'd like to do is use the broadcast mechanism to load the data structure once, per node. But, I can't serialize the data structure from the driver. Any ideas? Thanks! Aaron