Re: Computing hamming distance over large data set

2016-02-12 Thread Charlie Hack
I ran across DIMSUM a while ago but never used it. https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html Annoy is wonderful if you want to make queries. If you want to do the "self similarity join" you might look at DIMSUM or preferably if at all possibl

Re: Computing hamming distance over large data set

2016-02-12 Thread Maciej Szymkiewicz
There is also this: https://github.com/soundcloud/cosine-lsh-join-spark On 02/11/2016 10:12 PM, Brian Morton wrote: > Karl, > > This is tremendously useful. Thanks very much for your insight. > > Brian > > On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley > wrote: > >

Re: Computing hamming distance over large data set

2016-02-11 Thread Brian Morton
Karl, This is tremendously useful. Thanks very much for your insight. Brian On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley wrote: > Hi, > > It sounds like you're trying to solve the approximate nearest neighbor > (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) > approac

Re: Computing hamming distance over large data set

2016-02-11 Thread Karl Higley
Hi, It sounds like you're trying to solve the approximate nearest neighbor (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) approach isn't likely to help all that much, because the number of pairwise comparisons grows quickly as the size of the dataset increases. I'd look at