Hi everyone,
Since spark 2.1.0 introduces LSH (
http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing),
we want to use LSH to find approximately nearest neighbors. Basically, We
have dataset with about 7M rows. we want to use cosine distance to meassure
the similarity between items, so we use *RandomSignProjectionLSH* (
https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead of
*BucketedRandomProjectionLSH*. I try to tune some configurations such as
serialization, memory fraction, executor memory (~6G), number of executors
( ~20), memory overhead ..., but nothing works. I often get error
"java.lang.OutOfMemoryError:
Java heap space" while running. I know that this implementation is done by
engineer at Uber but I don't know right configurations,.. to run the
algorithm at scale. Do they need very big memory to run it?

Any help would be appreciated.
Thanks

Reply via email to