Hi all,
Currently, I'm working on implementing LSH on spark. The problem leads to
follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of
vectors need to compute distance and an other RDD[(Int, Vector)] stores all
vectors with their ids. Can anyone  suggest an efficiency way to compute
distance? My simple version that I try first is as follows but it's
inefficient because it require a lot of shuffling data over the network.

rdd1: RDD[(Int, Int)] = ..
rdd2: RDD[(Int, Vector)] = ...
val distances = rdd2.cartesian(rdd2)
      .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2)))
      .join(rdd1.map(x => (x, 1))
      .mapValues(x => {
         measure.compute(x._1._1, x._1._2)
      })

Thanks for any suggestion.

Reply via email to