Re: Joining by values

Sanjay Subramanian Sat, 03 Jan 2015 20:26:37 -0800

so I changed the code tordd1InvIndex.join(rdd2Pair).map(str => 
str._2).groupByKey().map(str => 
(str._1,str._2.toList)).collect().foreach(println)
Now it prints. Don't worry I will work on this to not output as List(...) But I 
am hoping that the JOIN question that @Dilip asked is hopefully answered :-) 
(2,List(1001,1000,1002,1003, 1004,1001,1006,1007))(3,List(1011,1012,1013,1010, 
1007,1009,1005,1008))(1,List(1001,1000,1002,1003, 1011,1012,1013,1010, 
1004,1001,1006,1007, 1007,1009,1005,1008))
      From: Shixiong Zhu <zsxw...@gmail.com>
 To: Sanjay Subramanian <sanjaysubraman...@yahoo.com> 
Cc: dcmovva <dilip.mo...@gmail.com>; "user@spark.apache.org" 
<user@spark.apache.org> 
 Sent: Saturday, January 3, 2015 8:15 PM
 Subject: Re: Joining by values
   
call `map(_.toList)` to convert `CompactBuffer` to `List`
Best Regards,Shixiong Zhu
2015-01-04 12:08 GMT+08:00 Sanjay Subramanian 
<sanjaysubraman...@yahoo.com.invalid>:




hi Take a look at the code here I 
wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala

/*    rdd1.txt

    1~4,5,6,7
    2~4,5
    3~6,7

    rdd2.txt

    4~1001,1000,1002,1003
    5~1004,1001,1006,1007
    6~1007,1009,1005,1008
    7~1011,1012,1013,1010

*/
    val sconf = new 
SparkConf().setMaster("local").setAppName("MedicalSideFx-PairRddJoin")
    val sc = new SparkContext(sconf)


    val rdd1 = "/path/to/rdd1.txt"
    val rdd2 = "/path/to/rdd2.txt"

    val rdd1InvIndex = sc.textFile(rdd1).map(x => (x.split('~')(0), 
x.split('~')(1))).flatMapValues(str => str.split(',')).map(str => (str._2, 
str._1))
    val rdd2Pair = sc.textFile(rdd2).map(str => (str.split('~')(0), 
str.split('~')(1)))
    rdd1InvIndex.join(rdd2Pair).map(str => 
str._2).groupByKey().collect().foreach(println)

This outputs the following . I think this may be essentially what u r looking 
for(I have to understand how to NOT print as 
CompactBuffer)(2,CompactBuffer(1001,1000,1002,1003, 1004,1001,1006,1007))
(3,CompactBuffer(1011,1012,1013,1010, 1007,1009,1005,1008))
(1,CompactBuffer(1001,1000,1002,1003, 1011,1012,1013,1010, 1004,1001,1006,1007, 
1007,1009,1005,1008))

      From: Sanjay Subramanian <sanjaysubraman...@yahoo.com.INVALID>
 To: dcmovva <dilip.mo...@gmail.com>; "user@spark.apache.org" 
<user@spark.apache.org> 
 Sent: Saturday, January 3, 2015 12:19 PM
 Subject: Re: Joining by values
   
This is my design. Now let me try and code it in Spark.
rdd1.txt =========1~4,5,6,72~4,53~6,7
rdd2.txt 
========4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010
TRANSFORM 1===========map each value to key (like an inverted 
index)4~15~16~17~15~24~26~37~3
TRANSFORM 2===========Join keys in transform 1 and 
rdd24~1,1001,1000,1002,10034~2,1001,1000,1002,10035~1,1004,1001,1006,10075~2,1004,1001,1006,10076~1,1007,1009,1005,10086~3,1007,1009,1005,10087~1,1011,1012,1013,10107~3,1011,1012,1013,1010
TRANSFORM 3===========Split key in transform 2 with "~" and keep key(1) i.e. 
1,2,31~1001,1000,1002,10032~1001,1000,1002,10031~1004,1001,1006,10072~1004,1001,1006,10071~1007,1009,1005,10083~1007,1009,1005,10081~1011,1012,1013,10103~1011,1012,1013,1010
TRANSFORM 4===========join by key 
1~1001,1000,1002,1003,1004,1001,1006,1007,1007,1009,1005,1008,1011,1012,1013,10102~1001,1000,1002,1003,1004,1001,1006,10073~1007,1009,1005,1008,1011,1012,1013,1010

 

     From: dcmovva <dilip.mo...@gmail.com>
 To: user@spark.apache.org 
 Sent: Saturday, January 3, 2015 10:10 AM
 Subject: Joining by values
   
I have a two pair RDDs in spark like this

rdd1 = (1 -> [4,5,6,7])
  (2 -> [4,5])
  (3 -> [6,7])


rdd2 = (4 -> [1001,1000,1002,1003])
  (5 -> [1004,1001,1006,1007])
  (6 -> [1007,1009,1005,1008])
  (7 -> [1011,1012,1013,1010])
I would like to combine them to look like this.

joinedRdd = (1 ->
[1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013])
        (2 -> [1000,1001,1002,1003,1004,1006,1007])
        (3 -> [1005,1007,1008,1009,1010,1011,1012,1013])


Can someone suggest me how to do this.

Thanks Dilip



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-values-tp20954.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Joining by values

Reply via email to