This is my first thought, please suggest any further improvement:
1. Create a rdd of your dataset
2. Do an cross join to generate pairs
3. Apply reducebykey and compute distance. You will get a rdd with keypairs
and distance

Best
Ayan
On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote:

> Dear Sparkers,
>
> I am working on an algorithm which requires the pair distance between all
> points (eg. DBScan, LOF, etc.). Computing this for *n* points will
> require produce a n^2 matrix. If the distance measure is symmetrical, this
> can be reduced to (n^2)/2. What would be the most optimal way of computing
> this?
>
> The paper *Pairwise Element Computation with MapReduce
> <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper
> describes different approaches to optimize this process within a map-reduce
> model. Although I don't believe this is applicable to Spark. How would you
> guys approach this?
>
> I first thought about broadcasting the original points to all the workers,
> and then compute the distances across the different workers. Although this
> requires all the points to be distributed across all the machines. But this
> feels rather brute-force, what do you guys think.
>
> I don't expect full solutions, but some pointers would be great. I think a
> good solution can be re-used for many algorithms.
>
> Kind regards,
> Fokko Driesprong
>

Reply via email to