So if your data can be kept in memory on the driver node then you don't really need spark? If you want to use it for hadoop reading then i'd immediately call collect after you open it and then you can do normal scala collections operations.
On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar <[email protected]> wrote: > Hi Folks, > > I am new to spark -and this is probably a basic question. > > I have a file on the hdfs > > 1, one > 1, uno > 2, two > 2, dos > > I want to create a multi Map RDD RDD[Map[String,List[String]]] > > {"1"->["one","uno"], "2"->["two","dos"]} > > > First I read the file > val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache() > > val identityDataList:RDD[List[String]]= > identityData.map{ line => > val splits= line.split(",") > splits.toList > } > > Then I group them by the first element > > val grouped:RDD[(String,Iterable[List[String]])]= > songArtistDataList.groupBy{ > element =>{ > element(0) > } > } > > Then I do the equivalent of mapValues of scala collections to get rid of > the first element > > val groupedWithValues:RDD[(String,List[String])] = > grouped.flatMap[(String,List[String])]{ case (key,list)=>{ > List((key,list.map{element => { > element(1) > }}.toList)) > } > } > > for this to actually materialize I do collect > > val groupedAndCollected=groupedWithValues.collect() > > I get an Array[String,List[String]]. > > I am trying to figure out if there is a way for me to get > Map[String,List[String]] (a multimap), or to create an > RDD[Map[String,List[String]] ] > > > I am sure there is something simpler, I would appreciate advice. > > Many thanks, > Amit > > > > > > > > > >
