call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu
2015-01-04 12:08 GMT+08:00 Sanjay Subramanian < sanjaysubraman...@yahoo.com.invalid>: > hi > Take a look at the code here I wrote > > https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala > > /* rdd1.txt > > 1~4,5,6,7 > 2~4,5 > 3~6,7 > > rdd2.txt > > 4~1001,1000,1002,1003 > 5~1004,1001,1006,1007 > 6~1007,1009,1005,1008 > 7~1011,1012,1013,1010 > > */ > val sconf = new > SparkConf().setMaster("local").setAppName("MedicalSideFx-PairRddJoin") > val sc = new SparkContext(sconf) > > > val rdd1 = "/path/to/rdd1.txt" > val rdd2 = "/path/to/rdd2.txt" > > val rdd1InvIndex = sc.textFile(rdd1).map(x => (x.split('~')(0), > x.split('~')(1))).flatMapValues(str => str.split(',')).map(str => (str._2, > str._1)) > val rdd2Pair = sc.textFile(rdd2).map(str => (str.split('~')(0), > str.split('~')(1))) > rdd1InvIndex.join(rdd2Pair).map(str => > str._2).groupByKey().collect().foreach(println) > > > This outputs the following . I think this may be essentially what u r looking > for > > (I have to understand how to NOT print as CompactBuffer) > > (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007)) > (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008)) > (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, > 1004,1001,1006,1007, 1007,1009,1005,1008)) > > > ------------------------------ > *From:* Sanjay Subramanian <sanjaysubraman...@yahoo.com.INVALID> > *To:* dcmovva <dilip.mo...@gmail.com>; "user@spark.apache.org" < > user@spark.apache.org> > *Sent:* Saturday, January 3, 2015 12:19 PM > *Subject:* Re: Joining by values > > This is my design. Now let me try and code it in Spark. > > rdd1.txt > ========= > 1~4,5,6,7 > 2~4,5 > 3~6,7 > > rdd2.txt > ======== > 4~1001,1000,1002,1003 > 5~1004,1001,1006,1007 > 6~1007,1009,1005,1008 > 7~1011,1012,1013,1010 > > TRANSFORM 1 > =========== > map each value to key (like an inverted index) > 4~1 > 5~1 > 6~1 > 7~1 > 5~2 > 4~2 > 6~3 > 7~3 > > TRANSFORM 2 > =========== > Join keys in transform 1 and rdd2 > 4~1,1001,1000,1002,1003 > 4~2,1001,1000,1002,1003 > 5~1,1004,1001,1006,1007 > 5~2,1004,1001,1006,1007 > 6~1,1007,1009,1005,1008 > 6~3,1007,1009,1005,1008 > 7~1,1011,1012,1013,1010 > 7~3,1011,1012,1013,1010 > > TRANSFORM 3 > =========== > Split key in transform 2 with "~" and keep key(1) i.e. 1,2,3 > 1~1001,1000,1002,1003 > 2~1001,1000,1002,1003 > 1~1004,1001,1006,1007 > 2~1004,1001,1006,1007 > 1~1007,1009,1005,1008 > 3~1007,1009,1005,1008 > 1~1011,1012,1013,1010 > 3~1011,1012,1013,1010 > > TRANSFORM 4 > =========== > join by key > > 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010 > 2~1001,1000,1002,1003,1004,1001,1006,1007 > 3~1007,1009,1005,1008,1011,1012,1013,1010 > > > > > ------------------------------ > *From:* dcmovva <dilip.mo...@gmail.com> > *To:* user@spark.apache.org > *Sent:* Saturday, January 3, 2015 10:10 AM > *Subject:* Joining by values > > I have a two pair RDDs in spark like this > > rdd1 = (1 -> [4,5,6,7]) > (2 -> [4,5]) > (3 -> [6,7]) > > > rdd2 = (4 -> [1001,1000,1002,1003]) > (5 -> [1004,1001,1006,1007]) > (6 -> [1007,1009,1005,1008]) > (7 -> [1011,1012,1013,1010]) > I would like to combine them to look like this. > > joinedRdd = (1 -> > [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013]) > (2 -> [1000,1001,1002,1003,1004,1006,1007]) > (3 -> [1005,1007,1008,1009,1010,1011,1012,1013]) > > > Can someone suggest me how to do this. > > Thanks Dilip > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > >