Re: RDD with a Map

Amit Wed, 04 Jun 2014 08:10:51 -0700

Thanks folks. I was trying to get the RDD[multimap] so the collectAsMap is what 
I needed.


Best,
Amit

On Jun 4, 2014, at 6:53, Cheng Lian <lian.cs....@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar <kumarami...@gmail.com> wrote:
> 
> Hi Folks,
> 
> I am new to spark -and this is probably a basic question.
> 
> I have a file on the hdfs
> 
> 1, one
> 1, uno
> 2, two
> 2, dos
> 
> I want to create a multi Map RDD  RDD[Map[String,List[String]]]
> 
> {"1"->["one","uno"], "2"->["two","dos"]}
> Actually what you described is not a “multi-map RDD”, the type of this RDD 
> should be something like RDD[(String, List[String]]. RDD[Map[String, 
> List[String]]] indicates that each element within this RDD is itself a 
> Map[String, List[String]], and I don’t think this is what you want according 
> to the context.
> 
> 
> 
> First I read the file 
> val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
> You don’t need to call .cache() here since identityData is used only once, so 
> the cached data won’t be used anywhere.
> 
> 
> val identityDataList:RDD[List[String]]=
>       identityData.map{ line =>
>         val splits= line.split(",")
>         splits.toList
>     }
> Turn the text line into a (String, String) pair would be much more useful 
> since then you can call functions like groupByKey, which are defined in 
> PairRDDFunctions:
> 
> val identityDataPairs: RDD[(String, String)] = identityData.map { line =>
>   val Array(key, value) = line.split(",")
>   key -> value
> }
> 
> Then I group them by the first element
> 
>  val grouped:RDD[(String,Iterable[List[String]])]=
>     songArtistDataList.groupBy{
>       element =>{
>         element(0)
>       }
>     }
> Using groupByKey on pair RDDs is more convenient as mentioned above:
> 
> val grouped: RDD[(String, Iterable[String])] = identityDataPairs.groupByKey()
> 
> Then I do the equivalent of mapValues of scala collections to get rid of the 
> first element
> 
>  val groupedWithValues:RDD[(String,List[String])] =
>     grouped.flatMap[(String,List[String])]{ case (key,list)=>{
>       List((key,list.map{element => {
>         element(1)
>       }}.toList))
>     }
>     }
> 
> for this to actually materialize I do collect
> 
>  val groupedAndCollected=groupedWithValues.collect()
> 
> I get an Array[String,List[String]].
> 
> I am trying to figure out if there is a way for me to get 
> Map[String,List[String]] (a multimap), or to create an 
> RDD[Map[String,List[String]] ]
> To get a Map[String, Iterable[String]], you may simply call collectAsMap 
> which is only defined on pair RDDs:
> 
> val groupedAndCollected = grouped.collectAsMap()
> 
> 
> I am sure there is something simpler, I would appreciate advice.
> 
> Many thanks,
> Amit
> 
> At last, be careful if you are processing large volume of data, since 
> groupByKey is an expensive transformation, and collecting all the data to 
> driver side may simply cause OOM if the data can’t fit in the driver node.
> 
> 
> 
> Best
> Cheng
> 
> 
> XN0LCBi 
> ZSBjYXJlZnVsIGlmIHlvdSBhcmUgcHJvY2Vzc2luZyBsYXJnZSB2b2x1bWUgb2YgZGF0YSwgc2lu 
> Y2UgYGdyb3VwQnlLZXlgIGlzIGFuIGV4cGVuc2l2ZSB0cmFuc2Zvcm1hdGlvbiwgYW5kIGNvbGxl 
> Y3RpbmcgYWxsIHRoZSBkYXRhIHRvIGRyaXZlciBzaWRlIG1heSBzaW1wbHkgY2F1c2UgT09NIGlm 
> IHRoZSBkYXRhIGNhbid0IGZpdCBpbiB0aGUgZHJpdmVyIG5vZGUuPC9kaXY+PC9kaXY+PGJyPjwv 
> ZGl2Pg==" style="height:0;font-size:0em;padding:0;margin:0">

Re: RDD with a Map

Reply via email to