Hi Wei, Sparkler code was not available for benchmarking and so I picked up Jellyfish which uses SGD and if you look at the paper, the ideas are very similar to sparkler paper but Jellyfish is on shared memory and uses C code while sparkler was built on top of spark...Jellyfish used some interesting regularization like L1 ball projection and in-fact the paper produced the best rmse for netflix dataset using these ideas...
Jellyfish was a good runtime baseline to compare against...Of course the issue with shared memory is scalability... Sparkler was written when the ALS version that we have right now in mllib did not exist. The blocking idea used in Sparkler is not very different than what we have in ALS right now. But Sparkler used a faster checkpointing method (they called it Carousal Map) which is equally applicable to mllib ALS. Right now we checkpoint to hdfs but an efficient hot distributed cache (like memcached, redis or cassandra) will help improve the runtime further but will bring additional dependencies...right now mllib assumes hdfs and tachyon as checkpoint solution... In my experiments I found distributed spark ALS on 16 cores (8 nodes, each running 2 cores) to be around 2x slower compared to Jellyfish SGD also running on 16 cores in shared memory. The biggest machine we had is 16 core. Note that we did factor 20M users and 3M product ranges with ~400M ratings. 2x was a good number for me. I also found out spark ALS was 6-7X faster than Oryx (scalable version of Mahout ALS). In terms of RMSE, SGD produced RMSE very similar to ALS in my experiments with Netflix dataset. On new datasets, I still run Jellyfish to make sure that if on some dataset, SGD produces better local minima than ALS, that's an motivation to implement DSGD/Sparkler...I have cleaned up some netflix specific optimizations from Jellyfish. I also implemented a non-negative variant (similar projected gradient algorithm as used in NNLS.scala) but due to SGD implementing a projected gradient and getting good convergence turned out to be difficult. If you want I can open source the code. Spark ALS (and Oryx/Mahout's taste) represented the problem as least squares. For explicit no broadcast is needed and for implicit a global gram matrix is computed with rank x rank broadcast...if ranks are low, it is not a big issue. The way to represent the problem as least square has other benefits as well since we can map least square to a quadratic program and use the following ideas: https://issues.apache.org/jira/browse/SPARK-2426 We are doing much thorough analysis within our Spark ALS variants with Jellyfish on shared memory...I will point to you when the results are available.. I feel it might be worthwhile to represent the factorization problem on graphx but that's a different direction altogether from blocked matrix factorization ideas. Thanks. Deb On Sun, Aug 17, 2014 at 12:14 PM, Wei Tan <w...@us.ibm.com> wrote: > Hi Xiangrui, yes I was not clear in my previous posting. You did > optimization in block-level (which is brilliant!) so that blocks are joined > first to avoid redundant data transfer. > > I have two follow-up questions: > > 1. when you do rdd_a.join(rdd_b), which site this join will be done? > Say, if sizeOf(rdd_a)>>sizeOf(rdd_b) then Spark moves rdd_b to rdd_a (in a > per block manner) and do the join? > 2. for matrix factorization, there exist some distributed SGD > algorithms such as: > http://people.cs.umass.edu/~boduo/publications/2013EDBT-sparkler.pdf . > I plan to do some performance comparison recently. Any idea on which method > is better? > 3. > 4. > > Thanks! > > Wei > > --------------------------------- > Wei Tan, PhD > Research Staff Member > IBM T. J. Watson Research Center > *http://researcher.ibm.com/person/us-wtan* > <http://researcher.ibm.com/person/us-wtan> > > > > From: Xiangrui Meng <men...@gmail.com> > To: Wei Tan/Watson/IBM@IBMUS, > Cc: "user@spark.apache.org" <user@spark.apache.org> > Date: 08/04/2014 12:51 AM > Subject: Re: MLLib: implementing ALS with distributed matrix > ------------------------------ > > > > To be precise, the optimization is not `get all products that are > related to this user` but `get all products that are related to users > inside this block`. So a product factor won't be sent to the same > block more than once. We considered using GraphX to implement ALS, > which is much easier to understand. Right now, GraphX doesn't support > the optimization we use in ALS. But we are definitely looking for > simplification of MLlib's ALS code. If you have some good ideas, > please create a JIRA and we can move our discussion there. > > Best, > Xiangrui > > On Sun, Aug 3, 2014 at 8:39 PM, Wei Tan <w...@us.ibm.com> wrote: > > Hi, > > > > I wrote my centralized ALS implementation, and read the distributed > > implementation in MLlib. It uses InLink and OutLink to implement > functions > > like "get all products which are related to this user", and ultimately > > achieves model distribution. > > > > If we have a distributed matrix lib, the complex InLink and OutLink > logic > > can be relatively easily achieved with matrix select-row or select-column > > operators. With this InLink and OutLink based implementation, the > > distributed code is quite different and more complex than the centralized > > one. > > > > I have a question, could we move this complexity (InLink and OutLink) > to a > > lower distributed matrix manipulation layer, leaving the upper layer ALS > > algorithm "similar" to a centralized one? To be more specific, if we can > > make a DoubleMatrix a RDD, optimize the distributed manipulation of it, > we > > can make ALS algorithm easier to implement. > > > > Does it make any sense? > > > > Best regards, > > Wei > > > > --------------------------------- > > Wei Tan, PhD > > Research Staff Member > > IBM T. J. Watson Research Center > > http://researcher.ibm.com/person/us-wtan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > >