Pandas performance is definitely the issue here. You're using Pandas as an
ETL system, and it's more suitable as an endpoint rather than an conduit.
That is, it's great to dump your data there and do your analysis within
Pandas, subject to its constraints, but if you need to "back out" and use
some
There are two issues here:
1. Suppression of the true reason for failure. The spark runtime reports
"TypeError" but that is not why the operation failed.
2. The low performance of loading a pandas dataframe.
DISCUSSION
Number (1) is easily fixed, and the primary purpose for my post.
Number (2)
Why not attach a bigger hard disk to the machines and point your
SPARK_LOCAL_DIRS to it?
Thanks
Best Regards
On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti
wrote:
> Hello,
>
> Similar to the thread below [1], when I tried to create an RDD from a 4GB
> pandas dataframe I encountered the error
>
>