Re: Compute pairwise distance

Debasish Das Wed, 29 Apr 2015 19:41:47 -0700

Cross Join shuffle space might not be needed since most likely through
application specific logic (topK etc) you can cut the shuffle space...Also
most likely the brute force approach will be a benchmark tool to see how
better is your clustering based KNN solution since there are several ways
you can find approximate nearest neighbors for your application
(KMeans/KDTree/LSH etc)...


There is a variant that I will bring as a PR for this JIRA and we will of
course look into how to improve it further...the idea is to think about
distributed matrix multiply where both matrices A and B are distributed and
master coordinates pulling a partition of A and multiply it with B...

The idea suffices for kernel matrix generation as well if the number of
rows are modest (~10M or so)...

https://issues.apache.org/jira/browse/SPARK-4823


On Wed, Apr 29, 2015 at 3:25 PM, ayan guha <[email protected]> wrote:

> This is my first thought, please suggest any further improvement:
> 1. Create a rdd of your dataset
> 2. Do an cross join to generate pairs
> 3. Apply reducebykey and compute distance. You will get a rdd with
> keypairs and distance
>
> Best
> Ayan
> On 30 Apr 2015 06:11, "Driesprong, Fokko" <[email protected]> wrote:
>
>> Dear Sparkers,
>>
>> I am working on an algorithm which requires the pair distance between all
>> points (eg. DBScan, LOF, etc.). Computing this for *n* points will
>> require produce a n^2 matrix. If the distance measure is symmetrical, this
>> can be reduced to (n^2)/2. What would be the most optimal way of computing
>> this?
>>
>> The paper *Pairwise Element Computation with MapReduce
>> <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper
>> describes different approaches to optimize this process within a map-reduce
>> model. Although I don't believe this is applicable to Spark. How would you
>> guys approach this?
>>
>> I first thought about broadcasting the original points to all the
>> workers, and then compute the distances across the different workers.
>> Although this requires all the points to be distributed across all the
>> machines. But this feels rather brute-force, what do you guys think.
>>
>> I don't expect full solutions, but some pointers would be great. I think
>> a good solution can be re-used for many algorithms.
>>
>> Kind regards,
>> Fokko Driesprong
>>
>

Re: Compute pairwise distance

Reply via email to