Hi Deb, thanks for sharing your result. Please find my comments inline in blue. Best regards, Wei
From: Debasish Das <debasish.da...@gmail.com> To: Wei Tan/Watson/IBM@IBMUS, Cc: Xiangrui Meng <men...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org> Date: 08/17/2014 08:15 PM Subject: Re: MLLib: implementing ALS with distributed matrix Hi Wei, Sparkler code was not available for benchmarking and so I picked up Jellyfish which uses SGD and if you look at the paper, the ideas are very similar to sparkler paper but Jellyfish is on shared memory and uses C code while sparkler was built on top of spark...Jellyfish used some interesting regularization like L1 ball projection and in-fact the paper produced the best rmse for netflix dataset using these ideas... Jellyfish was a good runtime baseline to compare against...Of course the issue with shared memory is scalability... I know the authors of Sparkler and I may be able to get the code. The issue is that it needs a change to Spark by adding the CM, which makes the approach less appealing. For performance numbers, it seems that their run-time on a 40-machine cluster is similar to what I got using MLlib ALS, in a single 32-core machine. But this comparison is very rough and not so scientific -- just for your information. For a very brief back-of-envelop comparison (and correct me if I am wrong), it seems that Sparkler does not need to shape the rating matrix R in two ways: R and R'. Also it involves fewer data movement in each epoch -- since each product block only needs to move once (as a whole) to one destination user block. In ALS each product block needs to be split and moved to many user blocks. So I wonder why it does not provide a better performance over ALS. Sparkler was written when the ALS version that we have right now in mllib did not exist. The blocking idea used in Sparkler is not very different than what we have in ALS right now. But Sparkler used a faster checkpointing method (they called it Carousal Map) which is equally applicable to mllib ALS. Right now we checkpoint to hdfs but an efficient hot distributed cache (like memcached, redis or cassandra) will help improve the runtime further but will bring additional dependencies...right now mllib assumes hdfs and tachyon as checkpoint solution... In my experiments I found distributed spark ALS on 16 cores (8 nodes, each running 2 cores) to be around 2x slower compared to Jellyfish SGD also running on 16 cores in shared memory. The biggest machine we had is 16 core. Note that we did factor 20M users and 3M product ranges with ~400M ratings. 2x was a good number for me. I also found out spark ALS was 6-7X faster than Oryx (scalable version of Mahout ALS). In terms of RMSE, SGD produced RMSE very similar to ALS in my experiments with Netflix dataset. On new datasets, I still run Jellyfish to make sure that if on some dataset, SGD produces better local minima than ALS, that's an motivation to implement DSGD/Sparkler...I have cleaned up some netflix specific optimizations from Jellyfish. I also implemented a non-negative variant (similar projected gradient algorithm as used in NNLS.scala) but due to SGD implementing a projected gradient and getting good convergence turned out to be difficult. If you want I can open source the code. I have not run JellyFish yet. However, their paper mentioned a "3 min" run-time on the Netflix data set. I also ran the Netflix data in MLLib ALS (rank 30, no implicit preference). ALS does not seem improve after 5 iterations which takes 4-5 minutes. The machine has 32 cores but I allocate 16 to ALS. But the ALS machine is more powerful than JellyFish's paper -- it has 12 cores. So my result seems to be consistent with your 2x (JellyFish vs. MLlib ALS) number. W.r.t. RMSE, I can never get a number <0.85, regardless of whatever configuration I use (I tried different ranks, lambda, and iterations). However JellyFish is able to achieve <0.8 RMSE. I wonder what is the RMSE you obtained and what ALS configuration you used. In general, a shared memory implementation should be "always faster" than a map-reduce like implementation, in such a scale. Agree? Spark ALS (and Oryx/Mahout's taste) represented the problem as least squares. For explicit no broadcast is needed and for implicit a global gram matrix is computed with rank x rank broadcast...if ranks are low, it is not a big issue. The way to represent the problem as least square has other benefits as well since we can map least square to a quadratic program and use the following ideas: https://issues.apache.org/jira/browse/SPARK-2426 We are doing much thorough analysis within our Spark ALS variants with Jellyfish on shared memory...I will point to you when the results are available.. I read though this thread. You already did a lot of implementation on this. Thanks for your sharing and I will spend some time reading this. I feel it might be worthwhile to represent the factorization problem on graphx but that's a different direction altogether from blocked matrix factorization ideas. There are also some work on allowing asychronous update on DSDG. http://papers.nips.cc/paper/4894-more-effective-distributed-ml-via-a-stale-synchronous-parallel-parameter-server http://www.cs.cmu.edu/~epxing/papers/2014/Cui_etal_ATC14.pdf But this is not straightforward to implement in Spark, unless we can make accumulator as a globally shared variable scalable, or if we can make RDD updatable by block (if not by record). Thanks. Deb On Sun, Aug 17, 2014 at 12:14 PM, Wei Tan <w...@us.ibm.com> wrote: Hi Xiangrui, yes I was not clear in my previous posting. You did optimization in block-level (which is brilliant!) so that blocks are joined first to avoid redundant data transfer. I have two follow-up questions: 1. when you do rdd_a.join(rdd_b), which site this join will be done? Say, if sizeOf(rdd_a)>>sizeOf(rdd_b) then Spark moves rdd_b to rdd_a (in a per block manner) and do the join? 2. for matrix factorization, there exist some distributed SGD algorithms such as: http://people.cs.umass.edu/~boduo/publications/2013EDBT-sparkler.pdf . I plan to do some performance comparison recently. Any idea on which method is better? Thanks! Wei --------------------------------- Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Xiangrui Meng <men...@gmail.com> To: Wei Tan/Watson/IBM@IBMUS, Cc: "user@spark.apache.org" <user@spark.apache.org> Date: 08/04/2014 12:51 AM Subject: Re: MLLib: implementing ALS with distributed matrix To be precise, the optimization is not `get all products that are related to this user` but `get all products that are related to users inside this block`. So a product factor won't be sent to the same block more than once. We considered using GraphX to implement ALS, which is much easier to understand. Right now, GraphX doesn't support the optimization we use in ALS. But we are definitely looking for simplification of MLlib's ALS code. If you have some good ideas, please create a JIRA and we can move our discussion there. Best, Xiangrui On Sun, Aug 3, 2014 at 8:39 PM, Wei Tan <w...@us.ibm.com> wrote: > Hi, > > I wrote my centralized ALS implementation, and read the distributed > implementation in MLlib. It uses InLink and OutLink to implement functions > like "get all products which are related to this user", and ultimately > achieves model distribution. > > If we have a distributed matrix lib, the complex InLink and OutLink logic > can be relatively easily achieved with matrix select-row or select-column > operators. With this InLink and OutLink based implementation, the > distributed code is quite different and more complex than the centralized > one. > > I have a question, could we move this complexity (InLink and OutLink) to a > lower distributed matrix manipulation layer, leaving the upper layer ALS > algorithm "similar" to a centralized one? To be more specific, if we can > make a DoubleMatrix a RDD, optimize the distributed manipulation of it, we > can make ALS algorithm easier to implement. > > Does it make any sense? > > Best regards, > Wei > > --------------------------------- > Wei Tan, PhD > Research Staff Member > IBM T. J. Watson Research Center > http://researcher.ibm.com/person/us-wtan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org