Re: questions about MLLib recommendation models

Xiangrui Meng Thu, 07 Aug 2014 19:15:07 -0700

ratings.map{ case Rating(u,m,r) => {
    val pred = model.predict(u, m)
    (r - pred)*(r - pred)
  }
}.mean()


The code doesn't work because the userFeatures and productFeatures
stored in the model are RDDs. You tried to serialize them into the
task closure, and execute `model.predict` on an executor, which won't
work because `model.predict` can only be called on the driver. We
should make this clear in the doc. You should use what Burak
suggested:

val predictions = model.predict(data.map(x => (x.user, x.product)))

Best,
Xiangrui

On Thu, Aug 7, 2014 at 1:20 PM, Burak Yavuz <[email protected]> wrote:
> Hi Jay,
>
> I've had the same problem you've been having in Question 1 with a synthetic 
> dataset. I thought I wasn't producing the dataset well enough. This seems to
> be a bug. I will open a JIRA for it.
>
> Instead of using:
>
> ratings.map{ case Rating(u,m,r) => {
>     val pred = model.predict(u, m)
>     (r - pred)*(r - pred)
>   }
> }.mean()
>
> you can use something like:
>
> val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, 
> x.product)))
> val predictionsAndRatings: RDD[(Double, Double)] = predictions.map{ x =>
>   def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 
> 1.0), 0.0) else r
>   ((x.user, x.product), mapPredictedRating(x.rating))
> }.join(data.map(x => ((x.user, x.product), x.rating))).values
>
> math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - 
> x._2)).mean())
>
> This work around worked for me.
>
> Regarding your question 2, it will be best of you do a special filtering of 
> the dataset so that you do train for that user and product.
> If we don't have any data trained on a user, there is no way to predict how 
> he would like a product.
> That filtering takes a lot of work though. I can share some code on that too 
> if you like.
>
> Best,
> Burak
>
> ----- Original Message -----
> From: "Jay Hutfles" <[email protected]>
> To: [email protected]
> Sent: Thursday, August 7, 2014 1:06:33 PM
> Subject: questions about MLLib recommendation models
>
> I have a few questions regarding a collaborative filtering model, and was
> hoping for some recommendations (no pun intended...)
>
> *Setup*
>
> I have a csv file with user/movie/ratings named unimaginatively
> 'movies.csv'.  Here are the contents:
>
> 0,0,5
> 0,1,5
> 0,2,0
> 0,3,0
> 1,0,5
> 1,3,0
> 2,1,4
> 2,2,0
> 3,0,0
> 3,1,0
> 3,2,5
> 3,3,4
> 4,0,0
> 4,1,0
> 4,2,5
>
> I then load it into an RDD with a nice command like
>
> val ratings = sc.textFile("movies.csv").map(_.split(',') match { case
> Array(u,m,r) => (Rating(u.toInt, m.toInt, r.toDouble))})
>
> So far so good.  I'm even okay building a model for predicting the absent
> values in the matrix with
>
> val rank = 10
> val iters = 20
> val model = ALS.train(ratings, rank, iters)
>
> I can then use the model to predict any user/movie rating without trouble,
> like
>
> model.predict(2, 0)
>
> *Question 1: *
>
> If I were to calculate, say, the mean squared error of the training set (or
> to my next question, a test set), this doesn't work:
>
> ratings.map{ case Rating(u,m,r) => {
>     val pred = model.predict(u, m)
>     (r - pred)*(r - pred)
>   }
> }.mean()
>
> Actually, any action on RDDs created by mapping over the RDD[Rating] with a
> model prediction  fails, like
>
> ratings.map{ case Rating(u, m, _) => model.predict(u, m) }.collect
>
> I get errors due to a "scala.MatchError: null".  Here's the exact verbiage:
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 26150.0:1 failed 1 times, most recent failure: Exception failure in TID
> 7091 on host localhost: scala.MatchError: null
>
> org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571)
>
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43)
>         $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:18)
>         $iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:18)
>         scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>         scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>         scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>         scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>         scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>         org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>         org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
>
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
>         org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>         org.apache.spark.scheduler.Task.run(Task.scala:51)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:744)
>
> I think I'm missing something, since I can build up a scala collection of
> the exact (user, movie) tuples I'm testing, map over that with the model
> prediction, and it works fine.  But if I map over the RDD[Rating], it
> doesn't.  Am I doing something obviously wrong?
>
> *Question 2:*
>
> I have a much larger data set, and instead of running the ALS algorithm on
> the whole set, it seems prudent to use the kFolds method in
> org.apache.spark.mllib.util.MLUtils to generate training/testing splits.
>
> It's rather sparse data, and there are cases where the test set has both
> users and movies that are not present in any Ratings in the training set.
>  When encountering these, the model shouts at me:
>
> java.util.NoSuchElementException: next on empty iterator
>
> Is it the case that the Alternating Least Squares method doesn't create
> models which predict values for untrained users/products?  My high-level
> understanding of the ALS implementation makes it seem understandable that
> the calculations depend on at least one rating for each user, and at least
> one for each movie.  Is that true?
>
> If so, should I simply filter out entries from the test set which have
> users or movies absent from the training set?  Or is kMeans not an
> appropriate way to generate test data for collaborative filtering?
>
> Actually, I should have probably just asked, "What is the best way to do
> testing for recommendation models?"  Leave it nice and general...
>
> Thanks in advance.  Sorry for the long ramble.
>    Jay
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: questions about MLLib recommendation models

Reply via email to