Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-27 Thread Xiangrui Meng
Could you try different ranks and see whether the task size changes? We do use YtY in the closure, which should work the same as broadcast. If that is the case, it should be safe to ignore this warning. -Xiangrui On Thu, Apr 23, 2015 at 4:52 AM, Christian S. Perone wrote: > All these warnings com

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-23 Thread Christian S. Perone
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the state where the flatMap is showing these warnings (w/ Spark 1.3.0, they are also shown in Spark 1.3.1): org.apache.spark.rdd.RDD.flatMap(RDD.scala:296) org.apache.spark.ml.recommendati

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-22 Thread Xiangrui Meng
This is the size of the serialized task closure. Is stage 246 part of ALS iterations, or something before or after it? -Xiangrui On Tue, Apr 21, 2015 at 10:36 AM, Christian S. Perone wrote: > Hi Sean, thanks for the answer. I tried to call repartition() on the input > with many different sizes an

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Christian S. Perone
Hi Sean, thanks for the answer. I tried to call repartition() on the input with many different sizes and it still continues to show that warning message. On Tue, Apr 21, 2015 at 7:05 AM, Sean Owen wrote: > I think maybe you need more partitions in your input, which might make > for smaller tasks

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Sean Owen
I think maybe you need more partitions in your input, which might make for smaller tasks? On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone wrote: > I keep seeing these warnings when using trainImplicit: > > WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). > The maxi

Re: MLlib -Collaborative Filtering

2015-04-19 Thread Nick Pentreath
You will have to get the two user factor vectors from the ALS model and compute the cosine similarity between them. You can do this using Breeze vectors: import breeze.linalg._ val user1 = new DenseVector[Double](userFactors.lookup("user1").head) val user2 = new DenseVector[Double](userFactors.loo

Re: MLlib -Collaborative Filtering

2015-04-19 Thread Christian S. Perone
The easiest way to do that is to use a similarity metric between the different user factors. On Sat, Apr 18, 2015 at 7:49 AM, riginos wrote: > Is there any way that i can see the similarity table of 2 users in that > algorithm? by that i mean the similarity between 2 users > > > > -- > View this

Re: MLlib -Collaborative Filtering

2015-04-18 Thread Nick Pentreath
What do you mean by similarity table of 2 users? Do you mean the similarity between 2 users? — Sent from Mailbox On Sat, Apr 18, 2015 at 11:09 AM, riginos wrote: > Is there any way that i can see the similarity table of 2 users in that > algorithm? > -- > View this message in context: >

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
It would be really helpful if you can help test the scalability of the new ALS impl: https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala . It should be faster and more scalable, but the code is messy now. Best, Xiangrui On Fri, Oct 3, 2014 at 11:57

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread jw.cmu
Thanks, Xiangrui. I didn't check the test error yet. I agree that rank 1000 might overfit for this particular dataset. Currently I'm just running some scalability tests - I'm trying to see how large the model can be scaled to given a fixed amount of hardware. -- View this message in context:

Re: MLlib Collaborative Filtering failed to run with rank 1000

2014-10-03 Thread Xiangrui Meng
The current impl of ALS constructs least squares subproblems in memory. So for rank 100, the total memory it requires is about 480,189 * 100^2 / 2 * 8 bytes ~ 20GB, divided by the number of blocks. For rank 1000, this number goes up to 2TB, unfortunately. There is a JIRA for optimizing ALS: https:/