Hi everyone, What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. (I know we have MLlib. But the existing code base is big, and some functions are not fully supported yet.)
Currently I use multiprocessing module of Python to boost the speed. But this only works for one node, while the data set is small. For many real cases, we may need to deal with gigabytes or even terabytes of data, with thousands of raw categorical attributes, which can lead to millions of discrete features, using 1-of-k representation. For these cases, one solution is to use distributed memory. That's why I am considering spark. And spark support Python! With Pyspark, we can import scikit-learn. But the question is how to make the scikit-learn code, decisionTree classifier for example, running in distributed computing mode, to benefit the power of Spark? Best, Rex