Hi Spark devs, I'm creating a streaming export functionality for RDDs and am having some trouble with large partitions. The RDD.toLocalIterator() call pulls over a partition at a time to the driver, and then streams the RDD out from that partition before pulling in the next one. When you have large partitions though, you can OOM the driver, especially when multiple of these exports are happening in the same SparkContext.
One idea I had was to repartition the RDD so partitions are smaller, but it's hard to know a priori what the partition count should be, and I'd like to avoid paying the shuffle cost if possible -- I think repartition to a higher partition count forces a shuffle. Is it feasible to rework this so the executor -> driver transfer in .toLocalIterator is a steady stream rather than a partition at a time? Thanks! Andrew