subject:"Re\: Practical configuration to run LSH in Spark 2.1.0"

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-23 Thread nguyen duc Tuan

I do a self-join. I tried to cache the transformed dataset before joining, but it didn't help too. 2017-02-23 13:25 GMT+07:00 Nick Pentreath : > And to be clear, are you doing a self-join for approx similarity? Or > joining to another dataset? > > > > On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath

And to be clear, are you doing a self-join for approx similarity? Or joining to another dataset? On Thu, 23 Feb 2017 at 02:01, nguyen duc Tuan wrote: > Hi Seth, > Here's the parameters that I used in my experiments. > - Number of executors: 16 > - Executor's memories: vary from 1G -> 2G -> 3G

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread nguyen duc Tuan

Hi Seth, Here's the parameters that I used in my experiments. - Number of executors: 16 - Executor's memories: vary from 1G -> 2G -> 3G - Number of cores per executor: 1-> 2 - Driver's memory: 1G -> 2G -> 3G - The similar threshold: 0.6 MinHash: - number of hash tables: 2 SignedRandomProjection: -

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Seth Hendrickson

I'm looking into this a bit further, thanks for bringing it up! Right now the LSH implementation only uses OR-amplification. The practical consequence of this is that it will select too many candidates when doing approximate near neighbor search and approximate similarity join. When we add AND-ampl

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath

The original Uber authors provided this performance test result: https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro This was for MinHash only though, so it's not clear about what the scalability is for the other metric types. The SignRandomProjectionLSH is not yet in

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-12 Thread nguyen duc Tuan

After all, I switched back to LSH implementation that I used before ( https://github.com/karlhigley/spark-neighbors ). I can run on my dataset now. If someone has any suggestion, please tell me. Thanks. 2017-02-12 9:25 GMT+07:00 nguyen duc Tuan : > Hi Timur, > 1) Our data is transformed to datase

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-11 Thread Timur Shenkao

Hello, 1) Are you sure that your data is "clean"? No unexpected missing values? No strings in unusual encoding? No additional or missing columns ? 2) How long does your job run? What about garbage collector parameters? Have you checked what happens with jconsole / jvisualvm ? Sincerely yours, Ti

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan

Hi Nick, Because we use *RandomSignProjectionLSH*, there is only one parameter for LSH is the number of hashes. I try with small number of hashes (2) but the error is still happens. And it happens when I call similarity join. After transformation, the size of dataset is about 4G. 2017-02-11 3:07

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath

What other params are you using for the lsh transformer? Are the issues occurring during transform or during the similarity join? On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote: > hi Das, > In general, I will apply them to larger datasets, so I want to use LSH, > which is more scaleable t

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan

hi Das, In general, I will apply them to larger datasets, so I want to use LSH, which is more scaleable than the approaches as you suggested. Have you tried LSH in Spark 2.1.0 before ? If yes, how do you set the parameters/configuration to make it work ? Thanks. 2017-02-10 19:21 GMT+07:00 Debasish

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das

If it is 7m rows and 700k features (or say 1m features) brute force row similarity will run fine as well...check out spark-4823...you can compare quality with approximate variant... On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote: > Hi everyone, > Since spark 2.1.0 introduces LSH (http://spark.ap

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

Re: Practical configuration to run LSH in Spark 2.1.0

11 matches

Site Navigation

Mail list logo

Footer information