Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or general estimates (10x, etc), that'd be great.
@Jeremy, if you can discuss this, what's an example of a project you implemented using these libraries + PySpark? Thanks everyone! On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > On a related note, I recently heard about Distributed R > <https://github.com/vertica/DistributedR>, which is coming out of > HP/Vertica and seems to be their proposition for machine learning at scale. > > It would be interesting to see some kind of comparison between that and > MLlib (and perhaps also SparkR > <https://github.com/amplab-extras/SparkR-pkg>?), especially since > Distributed R has a concept of distributed arrays and works on data > in-memory. Docs are here. > <https://github.com/vertica/DistributedR/tree/master/doc/platform> > > Nick > > > On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <r...@databricks.com> wrote: > >> They only compared their own implementations of couple algorithms on >> different platforms rather than comparing the different platforms >> themselves (in the case of Spark -- PySpark). I can write two variants of >> an algorithm on Spark and make them perform drastically differently. >> >> I have no doubt if you implement a ML algorithm in Python itself without >> any native libraries, the performance will be sub-optimal. >> >> What PySpark really provides is: >> >> - Using Spark transformations in Python >> - ML algorithms implemented in Scala (leveraging native numerical >> libraries >> for high performance), and callable in Python >> >> The paper claims "Python is now one of the most popular languages for >> ML-oriented programming", and that's why they went ahead with Python. >> However, as I understand, very few people actually implement algorithms in >> Python directly because of the sub-optimal performance. Most people >> implement algorithms in other languages (e.g. C / Java), and expose APIs >> in >> Python for ease-of-use. This is what we are trying to do with PySpark as >> well. >> >> >> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas < >> ignacio.zendejas...@gmail.com> wrote: >> >> > Has anyone had a chance to look at this paper (with title in subject)? >> > http://www.cs.rice.edu/~lp6/comparison.pdf >> > >> > Interesting that they chose to use Python alone. Do we know how much >> faster >> > Scala is vs. Python in general, if at all? >> > >> > As with any and all benchmarks, I'm sure there are caveats, but it'd be >> > nice to have a response to the question above for starters. >> > >> > Thanks, >> > Ignacio >> > >> > >