Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you can find approximate nearest neighbors for your application (KMeans/KDTree/LSH etc)...
There is a variant that I will bring as a PR for this JIRA and we will of course look into how to improve it further...the idea is to think about distributed matrix multiply where both matrices A and B are distributed and master coordinates pulling a partition of A and multiply it with B... The idea suffices for kernel matrix generation as well if the number of rows are modest (~10M or so)... https://issues.apache.org/jira/browse/SPARK-4823 On Wed, Apr 29, 2015 at 3:25 PM, ayan guha <guha.a...@gmail.com> wrote: > This is my first thought, please suggest any further improvement: > 1. Create a rdd of your dataset > 2. Do an cross join to generate pairs > 3. Apply reducebykey and compute distance. You will get a rdd with > keypairs and distance > > Best > Ayan > On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote: > >> Dear Sparkers, >> >> I am working on an algorithm which requires the pair distance between all >> points (eg. DBScan, LOF, etc.). Computing this for *n* points will >> require produce a n^2 matrix. If the distance measure is symmetrical, this >> can be reduced to (n^2)/2. What would be the most optimal way of computing >> this? >> >> The paper *Pairwise Element Computation with MapReduce >> <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper >> describes different approaches to optimize this process within a map-reduce >> model. Although I don't believe this is applicable to Spark. How would you >> guys approach this? >> >> I first thought about broadcasting the original points to all the >> workers, and then compute the distances across the different workers. >> Although this requires all the points to be distributed across all the >> machines. But this feels rather brute-force, what do you guys think. >> >> I don't expect full solutions, but some pointers would be great. I think >> a good solution can be re-used for many algorithms. >> >> Kind regards, >> Fokko Driesprong >> >