Hi,

I need help to run matrix factorization ALS algorithm in Spark MLlib.

I am using dataset(1.5Gb) having 480189 users and 17770 items formatted in 
similar way as Movielens dataset. 
I am trying to run MovieLensALS example jar on this dataset on AWS Spark EMR 
cluster having 14 M4.2xlarge slaves. 

Command run: 
/usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
/usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar 
/usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32 
--numIterations 50 --kryo s3://dataset/input_dataset

Issues I get:
If I increase rank to 70 or more and numIterations 15 or more, I get following 
errors:
1) stack overflow error 
2) No space left on device - shuffle phase

Could you please let me know if there are any parameters I should tune to make 
this algorithm work on this dataset?

For better rmse, I want to increase iterations. Am I missing something very 
trivial? Could anyone help me run this algorithm on this specific dataset with 
more iterations? 

Was anyone able to run ALS on spark with more than 100 iterations and rank more 
than 30?

Any help will be greatly appreciated.

Thanks and Regards,
Roshani

Reply via email to