Thank you for the response.
This does not work on the test case that I mentioned in the previous email.
val data1 = Seq((1 -> 2), (1 -> 5), (2 -> 3), (3 -> 20), (3 -> 16))
val data2 = Seq((1 -> 2), (3 -> 30), (3 -> 16), (5 -> 12))
val rdd1 = sc.parallelize(data1, 8)
val rdd2 = sc.parallelize(data
As you have same partitioner and number of partitions probably you can use
zipPartition and provide a user defined function to substract .
A very primitive example being.
val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7)
val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6)
val rdd1 = sc.parallelize(
We tried that but couldn't figure out a way to efficiently filter it. Lets
take two RDDs.
rdd1:
(1,2)
(1,5)
(2,3)
(3,20)
(3,16)
rdd2:
(1,2)
(3,30)
(3,16)
(5,12)
rdd1.leftOuterJoin(rdd2) and get rdd1.subtract(rdd2):
(1,(2,Some(2)))
(1,(5,Some(2)))
(2,(3,None))
(3,(20,Some(30)))
(3,(20,Some(16)
How about outer join?
On 9 May 2016 13:18, "Raghava Mutharaju" wrote:
> Hello All,
>
> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key
> (number of partitions are same for both the RDDs). We would like to
> subtract rdd2 from rdd1.
>
> The subtract code at
> https://github.com