There are two issues here: 1. Suppression of the true reason for failure. The spark runtime reports "TypeError" but that is not why the operation failed.
2. The low performance of loading a pandas dataframe. DISCUSSION Number (1) is easily fixed, and the primary purpose for my post. Number (2) is harder, and may lead us to abandon Spark. To answer Akhil, the process is too slow. Yes it will work, but with large dense datasets, the line data = [r.tolist() for r in data.to_records(index=False)] is basically a brick wall. It will take longer to load the RDD than to do all operations on it, by a large margin. Any help or guidance (should we write some custom loader?) would be appreciated. FDS -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888p13912.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org