Hi all, Currently, I'm working on implementing LSH on spark. The problem leads to follow problem. I have an RDD[(Int, Int)] stores all pairs of ids of vectors need to compute distance and an other RDD[(Int, Vector)] stores all vectors with their ids. Can anyone suggest an efficiency way to compute distance? My simple version that I try first is as follows but it's inefficient because it require a lot of shuffling data over the network.
rdd1: RDD[(Int, Int)] = .. rdd2: RDD[(Int, Vector)] = ... val distances = rdd2.cartesian(rdd2) .map(x => ((x._1._1, x._2._1), (x._1._2, x._2._2))) .join(rdd1.map(x => (x, 1)) .mapValues(x => { measure.compute(x._1._1, x._1._2) }) Thanks for any suggestion.