One thing needs to be mentioned is that, in fact, the schema is (userId,
itemId, nbPurchase), where nbPurchase is equivalent to ratings. I found that
there are many one-timers, which means the pairs whose nbPurchase = 1. The
number of these pairs is about 85% of all positive observations.
As the paper said, the low ratings will get a low confidence weight, so if I
understand correctly, these dominant one-timers will be more *unlikely* to
be recommended comparing to other items whose nbPurchase is bigger.
In fact, lambda is also considered as a potential problem, as in our case,
the lambda is set to 300, which is confirmed by the test set. Here is test
result :
*lambda = 65
EPR_in = 0.06518592593142056
EPR_out = 0.14789338884259276
lambda = 100
EPR_in = 0.06619274171311466
EPR_out = 0.13494609978226865
lambda = 300
EPR_in = 0.08814703345418627
EPR_out = 0.09522125434156471*
where EPR_in is given by training set and EPR_out is given by test set. It
seems 300 is the right lambda, since less overfitting.
Some other parameters are showed in the following code :
*val model = new ALS()
.setImplicitPrefs(implicitPrefs = true)
.setAlpha(1)
.setLambda(300)
.setRank(50)
.setIterations(40)
.setBlocks(8)
.setSeed(42)
.run(ratings_train)*
we set Alpha to 1, since the max nbPurchase is 1396. Not sure if Alpha is
already too big.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-tp7067p7916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.