Practical configuration to run LSH in Spark 2.1.0

nguyen duc Tuan Wed, 08 Feb 2017 23:55:55 -0800

Hi everyone,
Since spark 2.1.0 introduces LSH (
http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
we want to use LSH to find approximately nearest neighbors. Basically, We
have dataset with about 7M rows. we want to use cosine distance to meassure
the similarity between items, so we use *RandomSignProjectionLSH* (
https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead of
*BucketedRandomProjectionLSH*. I try to tune some configurations such as
serialization, memory fraction, executor memory (~6G), number of executors
( ~20), memory overhead ..., but nothing works. I often get error
"java.lang.OutOfMemoryError:
Java heap space" while running. I know that this implementation is done by
engineer at Uber but I don't know right configurations,.. to run the
algorithm at scale. Do they need very big memory to run it?


Any help would be appreciated.
Thanks

Practical configuration to run LSH in Spark 2.1.0

Reply via email to