Hello,

1) Are you sure that your data is "clean"?  No unexpected missing values?
No strings in unusual encoding? No additional or missing columns ?
2) How long does your job run? What about garbage collector parameters?
Have you checked what happens with jconsole / jvisualvm ?

Sincerely yours, Timur

On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <newvalu...@gmail.com>
wrote:

> Hi Nick,
> Because we use *RandomSignProjectionLSH*, there is only one parameter for
> LSH is the number of hashes. I try with small number of hashes (2) but the
> error is still happens. And it happens when I call similarity join. After
> transformation, the size of  dataset is about 4G.
>
> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>:
>
>> What other params are you using for the lsh transformer?
>>
>> Are the issues occurring during transform or during the similarity join?
>>
>>
>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <newvalu...@gmail.com>
>> wrote:
>>
>>> hi Das,
>>> In general, I will apply them to larger datasets, so I want to use LSH,
>>> which is more scaleable than the approaches as you suggested. Have you
>>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>>> parameters/configuration to make it work ?
>>> Thanks.
>>>
>>> 2017-02-10 19:21 GMT+07:00 Debasish Das <debasish.da...@gmail.com>:
>>>
>>> If it is 7m rows and 700k features (or say 1m features) brute force row
>>> similarity will run fine as well...check out spark-4823...you can compare
>>> quality with approximate variant...
>>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <newvalu...@gmail.com> wrote:
>>>
>>> Hi everyone,
>>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>>> latest/ml-features.html#locality-sensitive-hashing), we want to use LSH
>>> to find approximately nearest neighbors. Basically, We have dataset with
>>> about 7M rows. we want to use cosine distance to meassure the similarity
>>> between items, so we use *RandomSignProjectionLSH* (
>>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db)
>>> instead of *BucketedRandomProjectionLSH*. I try to tune some
>>> configurations such as serialization, memory fraction, executor memory
>>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works.
>>> I often get error "java.lang.OutOfMemoryError: Java heap space" while
>>> running. I know that this implementation is done by engineer at Uber but I
>>> don't know right configurations,.. to run the algorithm at scale. Do they
>>> need very big memory to run it?
>>>
>>> Any help would be appreciated.
>>> Thanks
>>>
>>>
>>>
>

Reply via email to