Re: partitioner aware subtract

2016-05-10 Thread Raghava Mutharaju
Thank you for the response. This does not work on the test case that I mentioned in the previous email. val data1 = Seq((1 -> 2), (1 -> 5), (2 -> 3), (3 -> 20), (3 -> 16)) val data2 = Seq((1 -> 2), (3 -> 30), (3 -> 16), (5 -> 12)) val rdd1 = sc.parallelize(data1, 8) val rdd2 = sc.parallelize(data

Re: partitioner aware subtract

2016-05-10 Thread Rishi Mishra
As you have same partitioner and number of partitions probably you can use zipPartition and provide a user defined function to substract . A very primitive example being. val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7) val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6) val rdd1 = sc.parallelize(

Re: partitioner aware subtract

2016-05-09 Thread Raghava Mutharaju
We tried that but couldn't figure out a way to efficiently filter it. Lets take two RDDs. rdd1: (1,2) (1,5) (2,3) (3,20) (3,16) rdd2: (1,2) (3,30) (3,16) (5,12) rdd1.leftOuterJoin(rdd2) and get rdd1.subtract(rdd2): (1,(2,Some(2))) (1,(5,Some(2))) (2,(3,None)) (3,(20,Some(30))) (3,(20,Some(16)

Re: partitioner aware subtract

2016-05-09 Thread ayan guha
How about outer join? On 9 May 2016 13:18, "Raghava Mutharaju" wrote: > Hello All, > > We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key > (number of partitions are same for both the RDDs). We would like to > subtract rdd2 from rdd1. > > The subtract code at > https://github.com