I am running my own PySpark application (solving matrix factorization using
Gemulla's DSGD algorithm). The program seemed to work fine on smaller
movielens dataset but failed on larger Netflix data. It too about 14 hours
to complete two iterations and lost an executor (I used totally 8 executors
al
Hi Peter,
Thanks for your response. Would you mind letting us know a bit more details
of your dataset?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Logistic-Regression-fails-on-small-data-set-tp21666p21690.html
Sent from the Apache Spark User List
Thanks, Xiangrui.
I didn't check the test error yet. I agree that rank 1000 might overfit for
this particular dataset. Currently I'm just running some scalability tests -
I'm trying to see how large the model can be scaled to given a fixed amount
of hardware.
--
View this message in context:
I was able to run collaborative filtering with low rank numbers, like 20~160
on the netflix dataset, but it fails due to the following error when I set
the rank to 1000:
14/10/03 03:27:36 WARN TaskSetManager: Loss was due to
java.lang.IllegalArgumentException
java.lang.IllegalArgumentException: Si
I'm trying to run Spark ALS using the netflix dataset but failed due to "No
space on device" exception. It seems the exception is thrown after the
training phase. It's not clear to me what is being written and where is the
output directory.
I was able to run the same code on the provided test.data