Maybe this is helpful
https://github.com/lensacom/sparkit-learn/blob/master/README.rst


Sent from my Verizon Wireless 4G LTE smartphone

-------- Original message --------
From: Mustafa Elbehery <elbeherymust...@gmail.com> 
Date: 12/06/2015  3:59 PM  (GMT-05:00) 
To: user <user@spark.apache.org> 
Subject: PySpark RDD with NumpyArray Structure 

Hi All, 
I would like to parallelize Python NumpyArray to apply scikit Learn algorithm 
on top of Spark. When I call sc.parallelize() I receive rdd of different 
structure. 
To be more precise, I am trying to have the following, 
X = [[ 0.49426097  1.45106697]
 [-1.42808099 -0.83706377]
 [ 0.33855918  1.03875871]
 ..., 
 [-0.05713876 -0.90926105]
 [-1.16939407  0.03959692]
 [ 0.26322951 -0.92649949]]
However, what I get when I cal SC.parallelize(X) is the following 
[array([ 0.49426097,  1.45106697]), array([-1.42808099, -0.83706377])]

Anyone tried this before ?

Reply via email to