Hi Jay,
I've had the same problem you've been having in Question 1 with a synthetic
dataset. I thought I wasn't producing the dataset well enough. This seems to
be a bug. I will open a JIRA for it.
Instead of using:
ratings.map{ case Rating(u,m,r) => {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
}
}.mean()
you can use something like:
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings: RDD[(Double, Double)] = predictions.map{ x =>
def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r,
1.0), 0.0) else r
((x.user, x.product), mapPredictedRating(x.rating))
}.join(data.map(x => ((x.user, x.product), x.rating))).values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).mean())
This work around worked for me.
Regarding your question 2, it will be best of you do a special filtering of the
dataset so that you do train for that user and product.
If we don't have any data trained on a user, there is no way to predict how he
would like a product.
That filtering takes a lot of work though. I can share some code on that too if
you like.
Best,
Burak
----- Original Message -----
From: "Jay Hutfles" <[email protected]>
To: [email protected]
Sent: Thursday, August 7, 2014 1:06:33 PM
Subject: questions about MLLib recommendation models
I have a few questions regarding a collaborative filtering model, and was
hoping for some recommendations (no pun intended...)
*Setup*
I have a csv file with user/movie/ratings named unimaginatively
'movies.csv'. Here are the contents:
0,0,5
0,1,5
0,2,0
0,3,0
1,0,5
1,3,0
2,1,4
2,2,0
3,0,0
3,1,0
3,2,5
3,3,4
4,0,0
4,1,0
4,2,5
I then load it into an RDD with a nice command like
val ratings = sc.textFile("movies.csv").map(_.split(',') match { case
Array(u,m,r) => (Rating(u.toInt, m.toInt, r.toDouble))})
So far so good. I'm even okay building a model for predicting the absent
values in the matrix with
val rank = 10
val iters = 20
val model = ALS.train(ratings, rank, iters)
I can then use the model to predict any user/movie rating without trouble,
like
model.predict(2, 0)
*Question 1: *
If I were to calculate, say, the mean squared error of the training set (or
to my next question, a test set), this doesn't work:
ratings.map{ case Rating(u,m,r) => {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
}
}.mean()
Actually, any action on RDDs created by mapping over the RDD[Rating] with a
model prediction fails, like
ratings.map{ case Rating(u, m, _) => model.predict(u, m) }.collect
I get errors due to a "scala.MatchError: null". Here's the exact verbiage:
org.apache.spark.SparkException: Job aborted due to stage failure: Task
26150.0:1 failed 1 times, most recent failure: Exception failure in TID
7091 on host localhost: scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:18)
$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:18)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
I think I'm missing something, since I can build up a scala collection of
the exact (user, movie) tuples I'm testing, map over that with the model
prediction, and it works fine. But if I map over the RDD[Rating], it
doesn't. Am I doing something obviously wrong?
*Question 2:*
I have a much larger data set, and instead of running the ALS algorithm on
the whole set, it seems prudent to use the kFolds method in
org.apache.spark.mllib.util.MLUtils to generate training/testing splits.
It's rather sparse data, and there are cases where the test set has both
users and movies that are not present in any Ratings in the training set.
When encountering these, the model shouts at me:
java.util.NoSuchElementException: next on empty iterator
Is it the case that the Alternating Least Squares method doesn't create
models which predict values for untrained users/products? My high-level
understanding of the ALS implementation makes it seem understandable that
the calculations depend on at least one rating for each user, and at least
one for each movie. Is that true?
If so, should I simply filter out entries from the test set which have
users or movies absent from the training set? Or is kMeans not an
appropriate way to generate test data for collaborative filtering?
Actually, I should have probably just asked, "What is the best way to do
testing for recommendation models?" Leave it nice and general...
Thanks in advance. Sorry for the long ramble.
Jay
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]