BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf
On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin <r...@databricks.com> wrote: > Actually I believe the same person started both projects. > > The Distributed R project from HP was started by Shivaram Venkataraman > when he was there. He since moved to Berkeley AMPLab to pursue a PhD and > SparkR was his latest project. > > > > On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> On a related note, I recently heard about Distributed R >> <https://github.com/vertica/DistributedR>, which is coming out of >> HP/Vertica and seems to be their proposition for machine learning at scale. >> >> It would be interesting to see some kind of comparison between that and >> MLlib (and perhaps also SparkR >> <https://github.com/amplab-extras/SparkR-pkg>?), especially since >> Distributed R has a concept of distributed arrays and works on data >> in-memory. Docs are here. >> <https://github.com/vertica/DistributedR/tree/master/doc/platform> >> >> Nick >> >> >> On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> They only compared their own implementations of couple algorithms on >>> different platforms rather than comparing the different platforms >>> themselves (in the case of Spark -- PySpark). I can write two variants of >>> an algorithm on Spark and make them perform drastically differently. >>> >>> I have no doubt if you implement a ML algorithm in Python itself without >>> any native libraries, the performance will be sub-optimal. >>> >>> What PySpark really provides is: >>> >>> - Using Spark transformations in Python >>> - ML algorithms implemented in Scala (leveraging native numerical >>> libraries >>> for high performance), and callable in Python >>> >>> The paper claims "Python is now one of the most popular languages for >>> ML-oriented programming", and that's why they went ahead with Python. >>> However, as I understand, very few people actually implement algorithms >>> in >>> Python directly because of the sub-optimal performance. Most people >>> implement algorithms in other languages (e.g. C / Java), and expose APIs >>> in >>> Python for ease-of-use. This is what we are trying to do with PySpark as >>> well. >>> >>> >>> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas < >>> ignacio.zendejas...@gmail.com> wrote: >>> >>> > Has anyone had a chance to look at this paper (with title in subject)? >>> > http://www.cs.rice.edu/~lp6/comparison.pdf >>> > >>> > Interesting that they chose to use Python alone. Do we know how much >>> faster >>> > Scala is vs. Python in general, if at all? >>> > >>> > As with any and all benchmarks, I'm sure there are caveats, but it'd be >>> > nice to have a response to the question above for starters. >>> > >>> > Thanks, >>> > Ignacio >>> > >>> >> >> >