If you don't want to compute all N^2 similarities, you need to implement
some kind of blocking first. For example, LSH (locally sensitive hashing).
A quick search gave this link to a Spark implementation:

http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing

On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Dear all,
>
> I'm trying to find an efficient way to build a k-NN graph for a large
> dataset. Precisely, I have a large set of high dimensional vector (say d
> >>> 10000) and I want to build a graph where those high dimensional points
> are the vertices and each one is linked to the k-nearest neighbor based on
> some kind similarity defined on the vertex spaces.
> My problem is to implement an efficient algorithm to compute the weight
> matrix of the graph. I need to compute a N*N similarities and the only way
> I know is to use "cartesian" operation follow by "map" operation on RDD.
> But, this is very slow when the N is large. Is there a more cleaver way to
> do this for an arbitrary similarity function ?
>
> Cheers,
>
> Jao
>

Reply via email to