Just as a note on this paper, apart from implementing the algorithms in naive 
Python, they also run it in a fairly inefficient way. In particular their 
implementations send the model out with every task closure, which is really 
expensive for a large model, and bring it back with collectAsMap(). It would be 
much more efficient to send it e.g. with SparkContext.broadcast() or keep it 
distributed on the cluster throughout the computation, instead of making the 
drive node a bottleneck for communication.

Implementing ML algorithms well by hand is unfortunately difficult, and this is 
why we have MLlib. The hope is that you either get your desired algorithm out 
of the box or get a higher-level primitive (e.g. stochastic gradient descent) 
that you can plug some functions into, without worrying about the communication.

Matei

On August 13, 2014 at 11:10:02 AM, Ignacio Zendejas 
(ignacio.zendejas...@gmail.com) wrote:

Has anyone had a chance to look at this paper (with title in subject)? 
http://www.cs.rice.edu/~lp6/comparison.pdf 

Interesting that they chose to use Python alone. Do we know how much faster 
Scala is vs. Python in general, if at all? 

As with any and all benchmarks, I'm sure there are caveats, but it'd be 
nice to have a response to the question above for starters. 

Thanks, 
Ignacio 

Reply via email to