+1 to all of the above esp.  Dimensionality reduction and locality sensitive 
hashing / min hashing. 


There's also an algorithm implemented in MLlib called DIMSUM which was 
developed at Twitter for this purpose. I've been meaning to try it and would be 
interested to hear about results you get. 





https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum













​Charlie 







—
Sent from Mailbox




On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
<michaelma...@yahoo.com.invalid>, wrote:


Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action






   
 


If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 

http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing












On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:

Dear all,


I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d >>> 10000) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use "cartesian" operation follow by "map" operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 




Cheers,




Jao

Reply via email to