Hi Deb, thanks for sharing your result. Please find my comments inline in 
blue.
Best regards,
Wei




From:   Debasish Das <debasish.da...@gmail.com>
To:     Wei Tan/Watson/IBM@IBMUS, 
Cc:     Xiangrui Meng <men...@gmail.com>, "user@spark.apache.org" 
<user@spark.apache.org>
Date:   08/17/2014 08:15 PM
Subject:        Re: MLLib: implementing ALS with distributed matrix



Hi Wei,

Sparkler code was not available for benchmarking and so I picked up 
Jellyfish which uses SGD and if you look at the paper, the ideas are very 
similar to sparkler paper but Jellyfish is on shared memory and uses C 
code while sparkler was built on top of spark...Jellyfish used some 
interesting regularization like L1 ball projection and in-fact the paper 
produced the best rmse for netflix dataset using these ideas...

Jellyfish was a good runtime baseline to compare against...Of course the 
issue with shared memory is scalability...

I know the authors of Sparkler and I may be able to get the code. The 
issue is that it needs a change to Spark by adding the CM, which makes the 
approach less appealing. For performance numbers, it seems that their 
run-time on a 40-machine cluster is similar to what I got using MLlib ALS, 
in a single 32-core machine. But this comparison is very rough and not so 
scientific -- just for your information.

For a very brief back-of-envelop comparison (and correct me if I am 
wrong), it seems that Sparkler does not need to shape the rating matrix R 
in two ways: R and R'. Also it involves fewer data movement in each epoch 
-- since each product block only needs to move once (as a whole) to one 
destination user block. In ALS each product block needs to be split and 
moved to many user blocks. So I wonder why it does not provide a better 
performance over ALS. 
 
Sparkler was written when the ALS version that we have right now in mllib 
did not exist. The blocking idea used in Sparkler is not very different 
than what we have in ALS right now. But Sparkler used a faster 
checkpointing method (they called it Carousal Map) which is equally 
applicable to mllib ALS. Right now we checkpoint to hdfs but an efficient 
hot distributed cache (like memcached, redis or cassandra) will help 
improve the runtime further but will bring additional dependencies...right 
now mllib assumes hdfs and tachyon as checkpoint solution...

In my experiments I found distributed spark ALS on 16 cores (8 nodes, each 
running 2 cores) to be around 2x slower compared to Jellyfish SGD also 
running on 16 cores in shared memory. The biggest machine we had is 16 
core. Note that we did factor 20M users and 3M product ranges with ~400M 
ratings. 2x was a good number for me. I also found out spark ALS was 6-7X 
faster than Oryx (scalable version of Mahout ALS).

In terms of RMSE, SGD produced RMSE very similar to ALS in my experiments 
with Netflix dataset. On new datasets, I still run Jellyfish to make sure 
that if on some dataset, SGD produces better local minima than ALS, that's 
an motivation to implement DSGD/Sparkler...I have cleaned up some netflix 
specific optimizations from Jellyfish. I also implemented a non-negative 
variant (similar projected gradient algorithm as used in NNLS.scala) but 
due to SGD implementing a projected gradient and getting good convergence 
turned out to be difficult. If you want I can open source the code.

I have not run JellyFish yet. However, their paper mentioned a "3 min" 
run-time on the Netflix data set. I also ran the Netflix data in MLLib ALS 
(rank 30, no implicit preference). ALS does not seem improve after 5 
iterations which takes 4-5 minutes. The machine has 32 cores but I 
allocate 16 to ALS. But the ALS machine is more powerful than JellyFish's 
paper -- it has 12 cores. So my result seems to be consistent with your 2x 
(JellyFish vs. MLlib ALS) number.

W.r.t. RMSE, I can never get a number <0.85, regardless of whatever 
configuration I use (I tried different ranks, lambda, and iterations). 
However JellyFish is able to achieve <0.8 RMSE. I wonder what is the RMSE 
you obtained and what ALS configuration you used.

In general, a shared memory implementation should be "always faster" than 
a map-reduce like implementation, in such a scale. Agree?
 
Spark ALS (and Oryx/Mahout's taste) represented the problem as least 
squares. For explicit no broadcast is needed and for implicit a global 
gram matrix is computed with rank x rank broadcast...if ranks are low, it 
is not a big issue. The way to represent the problem as least square has 
other benefits as well since we can map least square to a quadratic 
program and use the following ideas:

https://issues.apache.org/jira/browse/SPARK-2426

We are doing much thorough analysis within our Spark ALS variants with 
Jellyfish on shared memory...I will point to you when the results are 
available..

I read though this thread. You already did a lot of implementation on 
this. Thanks for your sharing and I will spend some time reading this.

I feel it might be worthwhile to represent the factorization problem on 
graphx but that's a different direction altogether from blocked matrix 
factorization ideas.

There are also some work on allowing asychronous update on DSDG. 

http://papers.nips.cc/paper/4894-more-effective-distributed-ml-via-a-stale-synchronous-parallel-parameter-server
http://www.cs.cmu.edu/~epxing/papers/2014/Cui_etal_ATC14.pdf

But this is not straightforward to implement in Spark, unless we can make 
accumulator as a globally shared variable scalable, or if we can make RDD 
updatable by block (if not by record).




Thanks.
Deb

On Sun, Aug 17, 2014 at 12:14 PM, Wei Tan <w...@us.ibm.com> wrote:
Hi Xiangrui, yes I was not clear in my previous posting. You did 
optimization in block-level (which is brilliant!) so that blocks are 
joined first to avoid redundant data transfer. 

I have two follow-up questions: 
1.      when you do rdd_a.join(rdd_b), which site this join will be done? 
Say, if sizeOf(rdd_a)>>sizeOf(rdd_b) then Spark moves rdd_b to rdd_a (in a 
per block manner) and do the join? 
2.      for matrix factorization, there exist some distributed SGD 
algorithms such as: 
http://people.cs.umass.edu/~boduo/publications/2013EDBT-sparkler.pdf . I 
plan to do some performance comparison recently. Any idea on which method 
is better? 
Thanks! 

Wei 

--------------------------------- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan 



From:        Xiangrui Meng <men...@gmail.com> 
To:        Wei Tan/Watson/IBM@IBMUS, 
Cc:        "user@spark.apache.org" <user@spark.apache.org> 
Date:        08/04/2014 12:51 AM 
Subject:        Re: MLLib: implementing ALS with distributed matrix 




To be precise, the optimization is not `get all products that are
related to this user` but `get all products that are related to users
inside this block`. So a product factor won't be sent to the same
block more than once. We considered using GraphX to implement ALS,
which is much easier to understand. Right now, GraphX doesn't support
the optimization we use in ALS. But we are definitely looking for
simplification of MLlib's ALS code. If you have some good ideas,
please create a JIRA and we can move our discussion there.

Best,
Xiangrui

On Sun, Aug 3, 2014 at 8:39 PM, Wei Tan <w...@us.ibm.com> wrote:
> Hi,
>
>   I wrote my centralized ALS implementation, and read the distributed
> implementation in MLlib. It uses InLink and OutLink to implement 
functions
> like "get all products which are related to this user", and ultimately
> achieves model distribution.
>
>   If we have a distributed matrix lib, the complex InLink and OutLink 
logic
> can be relatively easily achieved with matrix select-row or 
select-column
> operators. With this InLink and OutLink based implementation, the
> distributed code is quite different and more complex than the 
centralized
> one.
>
>   I have a question, could we move this complexity (InLink and OutLink) 
to a
> lower distributed matrix manipulation layer, leaving the upper layer ALS
> algorithm "similar" to a centralized one? To be more specific, if we can
> make a DoubleMatrix a RDD, optimize the distributed manipulation of it, 
we
> can make ALS algorithm easier to implement.
>
>   Does it make any sense?
>
>   Best regards,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Reply via email to