Hello All,

We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number
of partitions are same for both the RDDs). We would like to subtract rdd2
from rdd1.

The subtract code at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
seems to group the elements of both the RDDs using (x, null) where x is the
element of the RDD and partition them. Then it makes use of
subtractByKey(). This way, RDDs have to be repartitioned on x (which in our
case, is both key and value combined). In our case, both the RDDs are
already hash partitioned on the key of x. Can we take advantage of this by
having a PairRDD/HashPartitioner-aware subtract? Is there a way to use
mapPartitions() for this?

We tried to broadcast rdd2 and use mapPartitions. But this turns out to be
memory consuming and inefficient. We tried to do a local set difference
between rdd1 and the broadcasted rdd2 (in mapPartitions of rdd1). We did
use destroy() on the broadcasted value, but it does not help.

The current subtract method is slow for us. rdd1 and rdd2 are around 700MB
each and the subtract takes around 14 seconds.

Any ideas on this issue is highly appreciated.

Regards,
Raghava.

Reply via email to