Re: toPandas very slow

2016-03-22 Thread Josh Levy-Kramer
Hi all, Wez, I read your thread earlier today after I sent this message and its exciting someone of your caliber working on the issue :) For a short term solution i've created a Gist which performs the toPandas operation using the mapPartitions method suggested by Mark: https://gist.github.com/jo

Re: toPandas very slow

2016-03-22 Thread Wes McKinney
hi all, I recently did an analysis of the performance of toPandas summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/ ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007 One solution I'm planning for this is an alternate serializer for Spark DataFrames, with an optimize

Re: toPandas very slow

2016-03-22 Thread Mark Vervuurt
Hi Josh, The work around we figured out to solve network latency and out of memory problems with the toPandas method was to create Pandas DataFrames or Numpy Arrays using MapPartitions for each partition. Maybe a standard solution around this line of thought could be built. The integration is q