Hi all,
Wez, I read your thread earlier today after I sent this message and its
exciting someone of your caliber working on the issue :)
For a short term solution i've created a Gist which performs the toPandas
operation using the mapPartitions method suggested by Mark:
https://gist.github.com/jo
hi all,
I recently did an analysis of the performance of toPandas
summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/
ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007
One solution I'm planning for this is an alternate serializer for
Spark DataFrames, with an optimize
Hi Josh,
The work around we figured out to solve network latency and out of memory
problems with the toPandas method was to create Pandas DataFrames or Numpy
Arrays using MapPartitions for each partition. Maybe a standard solution around
this line of thought could be built. The integration is q