Dear Sparkers, I am working on an algorithm which requires the pair distance between all points (eg. DBScan, LOF, etc.). Computing this for *n* points will require produce a n^2 matrix. If the distance measure is symmetrical, this can be reduced to (n^2)/2. What would be the most optimal way of computing this?
The paper *Pairwise Element Computation with MapReduce <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper describes different approaches to optimize this process within a map-reduce model. Although I don't believe this is applicable to Spark. How would you guys approach this? I first thought about broadcasting the original points to all the workers, and then compute the distances across the different workers. Although this requires all the points to be distributed across all the machines. But this feels rather brute-force, what do you guys think. I don't expect full solutions, but some pointers would be great. I think a good solution can be re-used for many algorithms. Kind regards, Fokko Driesprong