Would `topByKey` help?
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
Best,
Karl
On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote:
> I'm trying to figure out a way to group by and return the top 100 records
> in that g
Hi Janardhan,
I collected some LSH papers while working on an RDD-based implementation.
Links at the end of the README here:
https://github.com/karlhigley/spark-neighbors
Keep me posted on what you come up with!
Best,
Karl
On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty
wrote:
> I was lookin
There are also some Spark packages for finding approximate nearest
neighbors using locality sensitive hashing:
https://spark-packages.org/?q=tags%3Alsh
On Fri, Jul 15, 2016 at 7:45 AM nguyen duc Tuan
wrote:
> Hi jeremycod,
> If you want to find top N nearest neighbors for all users using exact
>
for uggly title of email. I forgot to check it before send.
>
> 2016-04-28 10:10 GMT+07:00 Karl Higley :
>
>> One idea is to avoid materializing the pairs of points before computing
>> the distances between them. You could do that using the LSH signatures by
>> buil
One idea is to avoid materializing the pairs of points before computing the
distances between them. You could do that using the LSH signatures by
building (Signature, (Int, Vector)) tuples, grouping by signature, and then
iterating pairwise over the resulting lists of points to compute the
distance
For real time graph mutations and queries, you might consider a graph
database like Neo4j or TitanDB. Titan can be backed by HBase, which you're
already using, so that's probably worth a look.
On Thu, Feb 25, 2016, 9:55 AM Udbhav Agarwal
wrote:
> That’s a good thing you pointed out. Let me check
Hi,
It sounds like you're trying to solve the approximate nearest neighbor
(ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
approach isn't likely to help all that much, because the number of pairwise
comparisons grows quickly as the size of the dataset increases. I'd look at
Hi Alan,
I'm slow responding, so you may have already figured this out. Just in
case, though:
val approx = mat.columnSimilarities(0.1)
approxEntries.first()
res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)
The above is returning the cosine similarity between columns 1638 and
Hi,
What merge behavior do you want when A~=B, B~=C but A!=C? Should the merge
emit ABC? AB and BC? Something else?
Best,
Karl
On Sat, Nov 21, 2015 at 5:24 AM OcterA wrote:
> Hello,
>
> I have a set of X data (around 30M entry), I have to do a batch to merge
> data which are similar, at the en