Thought that Spark users may be interested in the outcome of the Spark / scikit-learn sprint that happened last month just after Strata...
---------- Forwarded message ---------- From: Olivier Grisel <olivier.gri...@ensta.org> Date: Fri, Feb 21, 2014 at 6:30 PM Subject: Re: [Scikit-learn-general] Spark+sklearn sprint outcome ? To: scikit-learn-general <scikit-learn-gene...@lists.sourceforge.net> 2014-02-21 16:06 GMT+01:00 Eustache DIEMERT <eusta...@diemert.fr>: > Hi there, > > Could someone that attended the sprint send a rough summary ? > > I'd be particularly interested about the tested approaches, those that > didn't work, those that seem promising and what the next steps could be ... We started with a general discussion on PySpark. It naturally features a Python wrapper to the mllib Scala distributed machine learning library [1] that is optimized to work on Spark. However Python users might still want to leverage existing numpy / scipy tools for some workloads. The main difficulty to use numpy-aware tools efficiently is that Sparks presents the data to the workers as an iterator over a large, possibly cluster-partitioned collection of elements called a RDD. If used naively one would load individual rows (1D numpy arrays) as elements of an RDD to represent the content of a 2D data matrix. This is not efficient because of the communication overhead between scala and python workers and because it prevent to do efficient BLAS operations that involve several rows at a time such as BLAS DGEMM calls via numpy.dot for instance. So this first issue was tackled by writing a block_rdd helper function [2] to concatenate a bunch rows (e.g. 1D numpy arrays or list of Python dicts) as chunked 2D numpy arrays or pandas DataFrame respectively. This makes it possible to train linear model incrementally more efficiently as done in [3]. Model averaging is done via a reduction step. We also discussed how we could make it easier to plot the distribution of data stored in a RDD and came up with the idea of computing histograms on the spark side while exposing it with the same API as the numpy.histogram function [4]. Have a look at the tests [5] for basic usage examples of all of the above. There is also some high level discussion of the scope of the project in [6]. [1] http://spark.incubator.apache.org/docs/latest/mllib-guide.html [2] https://github.com/ogrisel/spylearn/blob/master/spylearn/block_rdd.py [3] https://github.com/ogrisel/spylearn/blob/master/spylearn/linear_model.py [4] https://github.com/ogrisel/spylearn/blob/master/spylearn/histogram.py [5] https://github.com/ogrisel/spylearn/tree/master/test [6] https://github.com/ogrisel/spylearn/issues/1 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list scikit-learn-gene...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general