Hi Folks,
I am new to spark -and this is probably a basic question.
I have a file on the hdfs
1, one
1, uno
2, two
2, dos
I want to create a multi Map RDD RDD[Map[String,List[String]]]
{"1"->["one","uno"], "2"->["two","dos"]}
First I read the file
val identityData:RDD[String] = sc.textFile($path_to_the_file, 2).cache()
val identityDataList:RDD[List[String]]=
identityData.map{ line =>
val splits= line.split(",")
splits.toList
}
Then I group them by the first element
val grouped:RDD[(String,Iterable[List[String]])]=
songArtistDataList.groupBy{
element =>{
element(0)
}
}
Then I do the equivalent of mapValues of scala collections to get rid of
the first element
val groupedWithValues:RDD[(String,List[String])] =
grouped.flatMap[(String,List[String])]{ case (key,list)=>{
List((key,list.map{element => {
element(1)
}}.toList))
}
}
for this to actually materialize I do collect
val groupedAndCollected=groupedWithValues.collect()
I get an Array[String,List[String]].
I am trying to figure out if there is a way for me to get
Map[String,List[String]] (a multimap), or to create an
RDD[Map[String,List[String]] ]
I am sure there is something simpler, I would appreciate advice.
Many thanks,
Amit