Streaming partitions to driver for use in .toLocalIterator

Andrew Ash Wed, 18 Feb 2015 07:11:44 -0800

Hi Spark devs,

I'm creating a streaming export functionality for RDDs and am having some
trouble with large partitions.  The RDD.toLocalIterator() call pulls over a
partition at a time to the driver, and then streams the RDD out from that
partition before pulling in the next one.  When you have large partitions
though, you can OOM the driver, especially when multiple of these exports
are happening in the same SparkContext.


One idea I had was to repartition the RDD so partitions are smaller, but
it's hard to know a priori what the partition count should be, and I'd like
to avoid paying the shuffle cost if possible -- I think repartition to a
higher partition count forces a shuffle.

Is it feasible to rework this so the executor -> driver transfer in
.toLocalIterator is a steady stream rather than a partition at a time?

Thanks!
Andrew

Streaming partitions to driver for use in .toLocalIterator

Reply via email to