I'd suggest first reading the scaladoc for RDD and PairRDDFunctions to
familiarize yourself with all the operations available:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunc
I have the following code
*val nodes = lines.map(s =>{val fields = s.split("\\s+")
(fields(0),fields(1))}).distinct().groupByKey().cache()*
and when I print out the nodes RDD I get the following
*(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2,