Hi everyone, Since spark 2.1.0 introduces LSH ( http://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing), we want to use LSH to find approximately nearest neighbors. Basically, We have dataset with about 7M rows. we want to use cosine distance to meassure the similarity between items, so we use *RandomSignProjectionLSH* ( https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db) instead of *BucketedRandomProjectionLSH*. I try to tune some configurations such as serialization, memory fraction, executor memory (~6G), number of executors ( ~20), memory overhead ..., but nothing works. I often get error "java.lang.OutOfMemoryError: Java heap space" while running. I know that this implementation is done by engineer at Uber but I don't know right configurations,.. to run the algorithm at scale. Do they need very big memory to run it?
Any help would be appreciated. Thanks