I think a cheap way to repartition to a higher partition count without
shuffle would be valuable too. Right now you can choose whether to execute
a shuffle when going down in partition count, but going up in partition
count always requires a shuffle. For the need of having a smaller
partitions to
Another alternative would be to compress the partition in memory in a
streaming fashion instead of calling .toArray on the iterator. Would it be
an easier mitigation to the problem? Or, is it hard to compress the rows
one by one without materializing the full partition in memory using the
compressi
This would be pretty tricky to do -- the issue is that right now
sparkContext.runJob has you pass in a function from a partition to *one*
result object that gets serialized and sent back: Iterator[T] => U, and
that idea is baked pretty deep into a lot of the internals, DAGScheduler,
Task, Executors