Thanks folks. I was trying to get the RDD[multimap] so the collectAsMap is what I needed.
Best, Amit On Jun 4, 2014, at 6:53, Cheng Lian <lian.cs....@gmail.com> wrote: > On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar <kumarami...@gmail.com> wrote: > > Hi Folks, > > I am new to spark -and this is probably a basic question. > > I have a file on the hdfs > > 1, one > 1, uno > 2, two > 2, dos > > I want to create a multi Map RDD RDD[Map[String,List[String]]] > > {"1"->["one","uno"], "2"->["two","dos"]} > Actually what you described is not a “multi-map RDD”, the type of this RDD > should be something like RDD[(String, List[String]]. RDD[Map[String, > List[String]]] indicates that each element within this RDD is itself a > Map[String, List[String]], and I don’t think this is what you want according > to the context. > > > > First I read the file > val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache() > You don’t need to call .cache() here since identityData is used only once, so > the cached data won’t be used anywhere. > > > val identityDataList:RDD[List[String]]= > identityData.map{ line => > val splits= line.split(",") > splits.toList > } > Turn the text line into a (String, String) pair would be much more useful > since then you can call functions like groupByKey, which are defined in > PairRDDFunctions: > > val identityDataPairs: RDD[(String, String)] = identityData.map { line => > val Array(key, value) = line.split(",") > key -> value > } > > Then I group them by the first element > > val grouped:RDD[(String,Iterable[List[String]])]= > songArtistDataList.groupBy{ > element =>{ > element(0) > } > } > Using groupByKey on pair RDDs is more convenient as mentioned above: > > val grouped: RDD[(String, Iterable[String])] = identityDataPairs.groupByKey() > > Then I do the equivalent of mapValues of scala collections to get rid of the > first element > > val groupedWithValues:RDD[(String,List[String])] = > grouped.flatMap[(String,List[String])]{ case (key,list)=>{ > List((key,list.map{element => { > element(1) > }}.toList)) > } > } > > for this to actually materialize I do collect > > val groupedAndCollected=groupedWithValues.collect() > > I get an Array[String,List[String]]. > > I am trying to figure out if there is a way for me to get > Map[String,List[String]] (a multimap), or to create an > RDD[Map[String,List[String]] ] > To get a Map[String, Iterable[String]], you may simply call collectAsMap > which is only defined on pair RDDs: > > val groupedAndCollected = grouped.collectAsMap() > > > I am sure there is something simpler, I would appreciate advice. > > Many thanks, > Amit > > At last, be careful if you are processing large volume of data, since > groupByKey is an expensive transformation, and collecting all the data to > driver side may simply cause OOM if the data can’t fit in the driver node. > > > > Best > Cheng > > > XN0LCBi > ZSBjYXJlZnVsIGlmIHlvdSBhcmUgcHJvY2Vzc2luZyBsYXJnZSB2b2x1bWUgb2YgZGF0YSwgc2lu > Y2UgYGdyb3VwQnlLZXlgIGlzIGFuIGV4cGVuc2l2ZSB0cmFuc2Zvcm1hdGlvbiwgYW5kIGNvbGxl > Y3RpbmcgYWxsIHRoZSBkYXRhIHRvIGRyaXZlciBzaWRlIG1heSBzaW1wbHkgY2F1c2UgT09NIGlm > IHRoZSBkYXRhIGNhbid0IGZpdCBpbiB0aGUgZHJpdmVyIG5vZGUuPC9kaXY+PC9kaXY+PGJyPjwv > ZGl2Pg==" style="height:0;font-size:0em;padding:0;margin:0">