Hi everyone,

What is the best way to migrate existing scikit-learn code to PySpark
cluster? Then we can bring together the full power of both scikit-learn and
spark, to do scalable machine learning. (I know we have MLlib. But the
existing code base is big, and some functions are not fully supported yet.)

Currently I use multiprocessing module of Python to boost the speed. But
this only works for one node, while the data set is small.

For many real cases, we may need to deal with gigabytes or even terabytes
of data, with thousands of raw categorical attributes, which can lead to
millions of discrete features, using 1-of-k representation.

For these cases, one solution is to use distributed memory. That's why I am
considering spark. And spark support Python!
With Pyspark, we can import scikit-learn.

But the question is how to make the scikit-learn code, decisionTree
classifier for example, running in distributed computing mode, to benefit
the power of Spark?


Best,
Rex

Reply via email to