This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance
Best Ayan On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote: > Dear Sparkers, > > I am working on an algorithm which requires the pair distance between all > points (eg. DBScan, LOF, etc.). Computing this for *n* points will > require produce a n^2 matrix. If the distance measure is symmetrical, this > can be reduced to (n^2)/2. What would be the most optimal way of computing > this? > > The paper *Pairwise Element Computation with MapReduce > <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper > describes different approaches to optimize this process within a map-reduce > model. Although I don't believe this is applicable to Spark. How would you > guys approach this? > > I first thought about broadcasting the original points to all the workers, > and then compute the distances across the different workers. Although this > requires all the points to be distributed across all the machines. But this > feels rather brute-force, what do you guys think. > > I don't expect full solutions, but some pointers would be great. I think a > good solution can be re-used for many algorithms. > > Kind regards, > Fokko Driesprong >