Example ... val pageNames = sc.textFile(“pages.txt”).map(...) val pageMap = pageNames.collect().toMap() val bc = sc.broadcast(pageMap) val visits = sc.textFile(“visits.txt”).map(...) val joined = visits.map(v => (v._1, (bc.value(v._1), v._2)))
in this you are looking up pagenames in visits & translating it using the pages.txt mapping file. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Fri, May 2, 2014 at 4:16 AM, PengWeiPRC <peng.wei....@gmx.com> wrote: > Thanks, Rustagi. Yes, the global data is read-only and stays from the > beginning to the end of the whole Spark task. Actually, it is not only > identical for one Map/Reduce task, but used by a lot of map/reduce tasks of > mine. That's why I intend to put the data into each node of my cluster, and > hope to see if it is possible for a Spark Map/Reduce program to let all the > nodes read it simultaneously from their local disks rather than read it by > one node and broadcast to other nodes. Any suggestions for solving it? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-handle-this-situation-Huge-File-Shared-by-All-maps-and-Each-Computer-Has-one-copy-tp5139p5192.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >