subject:"Re\: Streaming partitions to driver for use in .toLocalIterator"

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-24 Thread Andrew Ash

I think a cheap way to repartition to a higher partition count without shuffle would be valuable too. Right now you can choose whether to execute a shuffle when going down in partition count, but going up in partition count always requires a shuffle. For the need of having a smaller partitions to

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Mingyu Kim

Another alternative would be to compress the partition in memory in a streaming fashion instead of calling .toArray on the iterator. Would it be an easier mitigation to the problem? Or, is it hard to compress the rows one by one without materializing the full partition in memory using the compressi

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Imran Rashid

This would be pretty tricky to do -- the issue is that right now sparkContext.runJob has you pass in a function from a partition to *one* result object that gets serialized and sent back: Iterator[T] => U, and that idea is baked pretty deep into a lot of the internals, DAGScheduler, Task, Executors