from:"Karl Higley"

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Karl Higley

Would `topByKey` help? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 Best, Karl On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote: > I'm trying to figure out a way to group by and return the top 100 records > in that g

Re: Locality sensitive hashing

2016-07-24 Thread Karl Higley

Hi Janardhan, I collected some LSH papers while working on an RDD-based implementation. Links at the end of the README here: https://github.com/karlhigley/spark-neighbors Keep me posted on what you come up with! Best, Karl On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty wrote: > I was lookin

Re: How to recommend most similar users using Spark ML

2016-07-17 Thread Karl Higley

There are also some Spark packages for finding approximate nearest neighbors using locality sensitive hashing: https://spark-packages.org/?q=tags%3Alsh On Fri, Jul 15, 2016 at 7:45 AM nguyen duc Tuan wrote: > Hi jeremycod, > If you want to find top N nearest neighbors for all users using exact >

Re: Compute

2016-04-27 Thread Karl Higley

for uggly title of email. I forgot to check it before send. > > 2016-04-28 10:10 GMT+07:00 Karl Higley : > >> One idea is to avoid materializing the pairs of points before computing >> the distances between them. You could do that using the LSH signatures by >> buil

Re: Compute

2016-04-27 Thread Karl Higley

One idea is to avoid materializing the pairs of points before computing the distances between them. You could do that using the LSH signatures by building (Signature, (Int, Vector)) tuples, grouping by signature, and then iterating pairwise over the resulting lists of points to compute the distance

Re: Reindexing in graphx

2016-02-25 Thread Karl Higley

For real time graph mutations and queries, you might consider a graph database like Neo4j or TitanDB. Titan can be backed by HBase, which you're already using, so that's probably worth a look. On Thu, Feb 25, 2016, 9:55 AM Udbhav Agarwal wrote: > That’s a good thing you pointed out. Let me check

Re: Computing hamming distance over large data set

2016-02-11 Thread Karl Higley

Hi, It sounds like you're trying to solve the approximate nearest neighbor (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) approach isn't likely to help all that much, because the number of pairwise comparisons grows quickly as the size of the dataset increases. I'd look at

Re: Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-02-03 Thread Karl Higley

Hi Alan, I'm slow responding, so you may have already figured this out. Just in case, though: val approx = mat.columnSimilarities(0.1) approxEntries.first() res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676) The above is returning the cosine similarity between columns 1638 and

Re: Spark : merging object with approximation

2015-11-25 Thread Karl Higley

Hi, What merge behavior do you want when A~=B, B~=C but A!=C? Should the merge emit ABC? AB and BC? Something else? Best, Karl On Sat, Nov 21, 2015 at 5:24 AM OcterA wrote: > Hello, > > I have a set of X data (around 30M entry), I have to do a batch to merge > data which are similar, at the en

Re: Selecting the top 100 records per group by?

Re: Locality sensitive hashing

Re: How to recommend most similar users using Spark ML

Re: Compute

Re: Compute

Re: Reindexing in graphx

Re: Computing hamming distance over large data set

Re: Product similarity with TF/IDF and Cosine similarity (DIMSUM)

Re: Spark : merging object with approximation

9 matches

Site Navigation

Mail list logo

Footer information