Re: Joining by values

Shixiong Zhu Sat, 03 Jan 2015 20:18:02 -0800

call `map(_.toList)` to convert `CompactBuffer` to `List`

Best Regards,
Shixiong Zhu


2015-01-04 12:08 GMT+08:00 Sanjay Subramanian <
sanjaysubraman...@yahoo.com.invalid>:

> hi
> Take a look at the code here I wrote
>
> https://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala
>
> /*    rdd1.txt
>
>     1~4,5,6,7
>     2~4,5
>     3~6,7
>
>     rdd2.txt
>
>     4~1001,1000,1002,1003
>     5~1004,1001,1006,1007
>     6~1007,1009,1005,1008
>     7~1011,1012,1013,1010
>
> */
>     val sconf = new 
> SparkConf().setMaster("local").setAppName("MedicalSideFx-PairRddJoin")
>     val sc = new SparkContext(sconf)
>
>
>     val rdd1 = "/path/to/rdd1.txt"
>     val rdd2 = "/path/to/rdd2.txt"
>
>     val rdd1InvIndex = sc.textFile(rdd1).map(x => (x.split('~')(0), 
> x.split('~')(1))).flatMapValues(str => str.split(',')).map(str => (str._2, 
> str._1))
>     val rdd2Pair = sc.textFile(rdd2).map(str => (str.split('~')(0), 
> str.split('~')(1)))
>     rdd1InvIndex.join(rdd2Pair).map(str => 
> str._2).groupByKey().collect().foreach(println)
>
>
> This outputs the following . I think this may be essentially what u r looking 
> for
>
> (I have to understand how to NOT print as CompactBuffer)
>
> (2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007))
> (3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008))
> (1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 
> 1004,1001,1006,1007, 1007,1009,1005,1008))
>
>
>   ------------------------------
>  *From:* Sanjay Subramanian <sanjaysubraman...@yahoo.com.INVALID>
> *To:* dcmovva <dilip.mo...@gmail.com>; "user@spark.apache.org" <
> user@spark.apache.org>
> *Sent:* Saturday, January 3, 2015 12:19 PM
> *Subject:* Re: Joining by values
>
> This is my design. Now let me try and code it in Spark.
>
> rdd1.txt
> =========
> 1~4,5,6,7
> 2~4,5
> 3~6,7
>
> rdd2.txt
> ========
> 4~1001,1000,1002,1003
> 5~1004,1001,1006,1007
> 6~1007,1009,1005,1008
> 7~1011,1012,1013,1010
>
> TRANSFORM 1
> ===========
> map each value to key (like an inverted index)
> 4~1
> 5~1
> 6~1
> 7~1
> 5~2
> 4~2
> 6~3
> 7~3
>
> TRANSFORM 2
> ===========
> Join keys in transform 1 and rdd2
> 4~1,1001,1000,1002,1003
> 4~2,1001,1000,1002,1003
> 5~1,1004,1001,1006,1007
> 5~2,1004,1001,1006,1007
> 6~1,1007,1009,1005,1008
> 6~3,1007,1009,1005,1008
> 7~1,1011,1012,1013,1010
> 7~3,1011,1012,1013,1010
>
> TRANSFORM 3
> ===========
> Split key in transform 2 with "~" and keep key(1) i.e. 1,2,3
> 1~1001,1000,1002,1003
> 2~1001,1000,1002,1003
> 1~1004,1001,1006,1007
> 2~1004,1001,1006,1007
> 1~1007,1009,1005,1008
> 3~1007,1009,1005,1008
> 1~1011,1012,1013,1010
> 3~1011,1012,1013,1010
>
> TRANSFORM 4
> ===========
> join by key
>
> 1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,1010
> 2~1001,1000,1002,1003,1004,1001,1006,1007
> 3~1007,1009,1005,1008,1011,1012,1013,1010
>
>
>
>
>  ------------------------------
>  *From:* dcmovva <dilip.mo...@gmail.com>
> *To:* user@spark.apache.org
> *Sent:* Saturday, January 3, 2015 10:10 AM
> *Subject:* Joining by values
>
> I have a two pair RDDs in spark like this
>
> rdd1 = (1 -> [4,5,6,7])
>   (2 -> [4,5])
>   (3 -> [6,7])
>
>
> rdd2 = (4 -> [1001,1000,1002,1003])
>   (5 -> [1004,1001,1006,1007])
>   (6 -> [1007,1009,1005,1008])
>   (7 -> [1011,1012,1013,1010])
> I would like to combine them to look like this.
>
> joinedRdd = (1 ->
> [1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
>         (2 -> [1000,1001,1002,1003,1004,1006,1007])
>         (3 -> [1005,1007,1008,1009,1010,1011,1012,1013])
>
>
> Can someone suggest me how to do this.
>
> Thanks Dilip
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>

Re: Joining by values

Reply via email to