Re: toPandas very slow

2016-03-22 Thread Josh Levy-Kramer
Hi all, Wez, I read your thread earlier today after I sent this message and its exciting someone of your caliber working on the issue :) For a short term solution i've created a Gist which performs the toPandas operation using the mapPartitions method suggested by Mark: https://gist.github.com/jo

Re: toPandas very slow

2016-03-22 Thread Wes McKinney
hi all, I recently did an analysis of the performance of toPandas summary: http://wesmckinney.com/blog/pandas-and-apache-arrow/ ipython notebook: https://gist.github.com/wesm/0cb5531b1c2e346a0007 One solution I'm planning for this is an alternate serializer for Spark DataFrames, with an optimize

Re: toPandas very slow

2016-03-22 Thread Mark Vervuurt
Hi Josh, The work around we figured out to solve network latency and out of memory problems with the toPandas method was to create Pandas DataFrames or Numpy Arrays using MapPartitions for each partition. Maybe a standard solution around this line of thought could be built. The integration is q

toPandas very slow

2016-03-22 Thread Josh Levy-Kramer
Hi, A common pattern in my work is querying large tables in Spark DataFrames and then needing to do more detailed analysis locally when the data can fit into memory. However, i've hit a few blockers. In Scala no well developed DataFrame library exists and in Python the `toPandas` function is very