There are two issues here:

1. Suppression of the true reason for failure. The spark runtime reports
"TypeError" but that is not why the operation failed.

2. The low performance of loading a pandas dataframe.


DISCUSSION

Number (1) is easily fixed, and the primary purpose for my post.
Number (2) is harder, and may lead us to abandon Spark. To answer Akhil, the
process is too slow. Yes it will work, but with large dense datasets, the
line

    data = [r.tolist() for r in data.to_records(index=False)]

is basically a brick wall. It will take longer to load the RDD than to do
all operations on it, by a large margin.

Any help or guidance (should we write some custom loader?) would be
appreciated.

FDS



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888p13912.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to