I should point out that I'm not sure what the performance of that project is.
I'd expect that native data frame in PySpark will be significantly more efficient than their DictRDD. It would be interesting to see a performance comparison for the pipelines relative to native Spark ML pipelines, if you do test both out. — Sent from Mailbox On Sat, Sep 12, 2015 at 10:52 PM, Rex X <[email protected]> wrote: > Jorn and Nick, > Thanks for answering. > Nick, the sparkit-learn project looks interesting. Thanks for mentioning it. > Rex > On Sat, Sep 12, 2015 at 12:05 PM, Nick Pentreath <[email protected]> > wrote: >> You might want to check out https://github.com/lensacom/sparkit-learn >> <https://github.com/lensacom/sparkit-learn/blob/master/README.rst> >> >> Though it's true for random >> Forests / trees you will need to use MLlib >> >> — >> Sent from Mailbox <https://www.dropbox.com/mailbox> >> >> >> On Sat, Sep 12, 2015 at 9:00 PM, Jörn Franke <[email protected]> wrote: >> >>> I fear you have to do the plumbing all yourself. This is the same for all >>> commercial and non-commercial libraries/analytics packages. It often also >>> depends on the functional requirements on how you distribute. >>> >>> Le sam. 12 sept. 2015 à 20:18, Rex X <[email protected]> a écrit : >>> >>>> Hi everyone, >>>> >>>> What is the best way to migrate existing scikit-learn code to PySpark >>>> cluster? Then we can bring together the full power of both scikit-learn and >>>> spark, to do scalable machine learning. (I know we have MLlib. But the >>>> existing code base is big, and some functions are not fully supported yet.) >>>> >>>> Currently I use multiprocessing module of Python to boost the speed. But >>>> this only works for one node, while the data set is small. >>>> >>>> For many real cases, we may need to deal with gigabytes or even >>>> terabytes of data, with thousands of raw categorical attributes, which can >>>> lead to millions of discrete features, using 1-of-k representation. >>>> >>>> For these cases, one solution is to use distributed memory. That's why I >>>> am considering spark. And spark support Python! >>>> With Pyspark, we can import scikit-learn. >>>> >>>> But the question is how to make the scikit-learn code, decisionTree >>>> classifier for example, running in distributed computing mode, to benefit >>>> the power of Spark? >>>> >>>> >>>> Best, >>>> Rex >>>> >>> >>
