Hi there, I was wondering if somebody could give me some suggestions about how to handle this situation:
I have a spark program, in which it reads a 6GB file first (Not RDD) locally, and then do the map/reduce tasks. This 6GB file contains information that will be shared by all the map tasks. Previously, I handled it using the broadcast function in Spark, which is like this: global_file = fileRead("filename") global_file.broadcast() rdd.map(ele => MapFunc(ele)) However, when running the spark program with a cluster of multiple computers, I found that the remote nodes waited forever for the broadcasting of the global_file. I think that it may not be a good solution to have each map task to load the global file by themselves, which would incur huge overhead. Actually, we have this global file in each node of our cluster. The ideal behavior I hope is that for each node, they can read this global file only from its local disk (and stay in memory), and then for all the map/reduce tasks scheduled to this node, it can share that data. Hence, the global file is neither like broadcasting variables, which is shared by all map/reduce tasks, nor private variables only seen by one map task. It is shared node-widely, which is read in each node only one time and shared by all the tasks mapped to this node. Could anybody tell me how to program in Spark to handle it? Thanks so much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-this-situation-Huge-File-Shared-by-All-maps-and-Each-Computer-Has-one-copy-tp5139.html Sent from the Apache Spark User List mailing list archive at Nabble.com.