Maybe this is helpful https://github.com/lensacom/sparkit-learn/blob/master/README.rst
Sent from my Verizon Wireless 4G LTE smartphone -------- Original message -------- From: Mustafa Elbehery <elbeherymust...@gmail.com> Date: 12/06/2015 3:59 PM (GMT-05:00) To: user <user@spark.apache.org> Subject: PySpark RDD with NumpyArray Structure Hi All, I would like to parallelize Python NumpyArray to apply scikit Learn algorithm on top of Spark. When I call sc.parallelize() I receive rdd of different structure. To be more precise, I am trying to have the following, X = [[ 0.49426097 1.45106697] [-1.42808099 -0.83706377] [ 0.33855918 1.03875871] ..., [-0.05713876 -0.90926105] [-1.16939407 0.03959692] [ 0.26322951 -0.92649949]] However, what I get when I cal SC.parallelize(X) is the following [array([ 0.49426097, 1.45106697]), array([-1.42808099, -0.83706377])] Anyone tried this before ?