Hi, Andrew, If I broadcast the Map: val map2=sc.broadcast(map1) I will get compilation error: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,String]] does not take parameters [error] val matchs= Vecs.map(term=>term.map{case (a,b)=>(map2(a),b)})
Seems it's still an RDD, so how to access it by value=map2(key) ? Thanks! Cheers, Dan 2015-07-22 2:20 GMT-05:00 Andrew Or <and...@databricks.com>: > Hi Dan, > > If the map is small enough, you can just broadcast it, can't you? It > doesn't have to be an RDD. Here's an example of broadcasting an array and > using it on the executors: > https://github.com/apache/spark/blob/c03299a18b4e076cabb4b7833a1e7632c5c0dabe/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala > . > > -Andrew > > 2015-07-21 19:56 GMT-07:00 ayan guha <guha.a...@gmail.com>: > >> Either you have to do rdd.collect and then broadcast or you can do a join >> On 22 Jul 2015 07:54, "Dan Dong" <dongda...@gmail.com> wrote: >> >>> Hi, All, >>> >>> >>> I am trying to access a Map from RDDs that are on different compute >>> nodes, but without success. The Map is like: >>> >>> val map1 = Map("aa"->1,"bb"->2,"cc"->3,...) >>> >>> All RDDs will have to check against it to see if the key is in the Map >>> or not, so seems I have to make the Map itself global, the problem is that >>> if the Map is stored as RDDs and spread across the different nodes, each >>> node will only see a piece of the Map and the info will not be complete to >>> check against the Map( an then replace the key with the corresponding >>> value) E,g: >>> >>> val matchs= Vecs.map(term=>term.map{case (a,b)=>(map1(a),b)}) >>> >>> But if the Map is not an RDD, how to share it like sc.broadcast(map1) >>> >>> Any idea about this? Thanks! >>> >>> >>> Cheers, >>> Dan >>> >>> >