Re: Sorted Multiple Outputs

Eugene Morozov Wed, 15 Jul 2015 06:05:35 -0700

Yiannis ,

It looks like you might explore other approach.

sc.textFile("input/path")
.map() // your own implementation
.partitionBy(new HashPartitioner(num))
.groupBy() //your own implementation, as a result - PairRDD of key vs Iterable 
of values
.foreachPartition()

On the last step you could sort all values for the key and store them into 
separate file even into the same directory of all other files for other keys. 
HashParititoner must guarantee that all values for specific key will reside in 
just one partition, but it might happen that one partition might contain more, 
than one key (with values). This I’m not sure, but that shouldn’t be a big deal 
as you would iterate over tuple<key, Iterable<value>> and store one key to a 
specific file.

On 15 Jul 2015, at 03:23, Yiannis Gkoufas <johngou...@gmail.com> wrote:

> Hi there,
> 
> I have been using the approach described here:
> 
> http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
> 
> In addition to that, I was wondering if there is a way to set the customize 
> the order of those values contained in each file.
> 
> Thanks a lot!

Eugene Morozov
fathers...@list.ru

Re: Sorted Multiple Outputs

Reply via email to