Hello All, We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number of partitions are same for both the RDDs). We would like to subtract rdd2 from rdd1.
The subtract code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala seems to group the elements of both the RDDs using (x, null) where x is the element of the RDD and partition them. Then it makes use of subtractByKey(). This way, RDDs have to be repartitioned on x (which in our case, is both key and value combined). In our case, both the RDDs are already hash partitioned on the key of x. Can we take advantage of this by having a PairRDD/HashPartitioner-aware subtract? Is there a way to use mapPartitions() for this? We tried to broadcast rdd2 and use mapPartitions. But this turns out to be memory consuming and inefficient. We tried to do a local set difference between rdd1 and the broadcasted rdd2 (in mapPartitions of rdd1). We did use destroy() on the broadcasted value, but it does not help. The current subtract method is slow for us. rdd1 and rdd2 are around 700MB each and the subtract takes around 14 seconds. Any ideas on this issue is highly appreciated. Regards, Raghava.