I do a self-join. I tried to cache the transformed dataset before joining,
but it didn't help too.
2017-02-23 13:25 GMT+07:00 Nick Pentreath :
> And to be clear, are you doing a self-join for approx similarity? Or
> joining to another dataset?
>
>
>
> On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan
And to be clear, are you doing a self-join for approx similarity? Or
joining to another dataset?
On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan wrote:
> Hi Seth,
> Here's the parameters that I used in my experiments.
> - Number of executors: 16
> - Executor's memories: vary from 1G -> 2G -> 3G
Hi Seth,
Here's the parameters that I used in my experiments.
- Number of executors: 16
- Executor's memories: vary from 1G -> 2G -> 3G
- Number of cores per executor: 1-> 2
- Driver's memory: 1G -> 2G -> 3G
- The similar threshold: 0.6
MinHash:
- number of hash tables: 2
SignedRandomProjection:
-
I'm looking into this a bit further, thanks for bringing it up! Right now
the LSH implementation only uses OR-amplification. The practical
consequence of this is that it will select too many candidates when doing
approximate near neighbor search and approximate similarity join. When we
add AND-ampl
The original Uber authors provided this performance test result:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro
This was for MinHash only though, so it's not clear about what the
scalability is for the other metric types.
The SignRandomProjectionLSH is not yet in
After all, I switched back to LSH implementation that I used before (
https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
now. If someone has any suggestion, please tell me.
Thanks.
2017-02-12 9:25 GMT+07:00 nguyen duc Tuan :
> Hi Timur,
> 1) Our data is transformed to datase
Hello,
1) Are you sure that your data is "clean"? No unexpected missing values?
No strings in unusual encoding? No additional or missing columns ?
2) How long does your job run? What about garbage collector parameters?
Have you checked what happens with jconsole / jvisualvm ?
Sincerely yours, Ti
Hi Nick,
Because we use *RandomSignProjectionLSH*, there is only one parameter for
LSH is the number of hashes. I try with small number of hashes (2) but the
error is still happens. And it happens when I call similarity join. After
transformation, the size of dataset is about 4G.
2017-02-11 3:07
What other params are you using for the lsh transformer?
Are the issues occurring during transform or during the similarity join?
On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote:
> hi Das,
> In general, I will apply them to larger datasets, so I want to use LSH,
> which is more scaleable t
hi Das,
In general, I will apply them to larger datasets, so I want to use LSH,
which is more scaleable than the approaches as you suggested. Have you
tried LSH in Spark 2.1.0 before ? If yes, how do you set the
parameters/configuration to make it work ?
Thanks.
2017-02-10 19:21 GMT+07:00 Debasish
If it is 7m rows and 700k features (or say 1m features) brute force row
similarity will run fine as well...check out spark-4823...you can compare
quality with approximate variant...
On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote:
> Hi everyone,
> Since spark 2.1.0 introduces LSH (http://spark.ap
11 matches
Mail list logo