On 04/27/2015 06:00 PM, Ganelin, Ilya wrote:
> Marco - why do you want data sorted both within and across partitions? If you 
> need to take an ordered sequence across all your data you need to either 
> aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an 
> ordered index to your data that matches the order it was stored on HDFS. You 
> can then get the data in order by filtering based on that index. Let me know 
> if that's not what you need - thanks!
> 

Basically, after a mapping d -> (k,v), I've to aggregate my data grouped
by key and I also want that the output of this aggregation is sorted. A
way to do that can be something like
flatpMapToPair(myMapFunc).reduceByKey(RangePartitioner,myReduceFunc).mapPartition(i
-> sort(i)).

But I was thinking that the sorting phase can be pushed down to the
shuffle phase, as the same thing is done in sortByKey and
repartitionAndSortWithinPartition, calling setKeyOrdering on the
shuffleRDD returned by reduceByKey (or combineByKey).


Am I wrong?

I'm not a Scala programmer, is there an easy way to do that with actual
java apis? If not, what is the quickest way to do that in Scala?

Also a more trival question. I can't find how to use RangePartitioner
from Java because I can't understand what to provide for  Ordering and
ClassTag constructor parameters from Java, where I can find some
reference/examples?

Thank you all,
Marco

> 
>  
> Sent with Good (www.good.com)
> 
> 
> -----Original Message-----
> From: Marco [marcope...@gmail.com<mailto:marcope...@gmail.com>]
> Sent: Monday, April 27, 2015 07:01 AM Eastern Standard Time
> To: user@spark.apache.org
> Subject: ReduceByKey and sorting within partitionsa
> 
> 
> Hi,
> 
> I'm trying, after reducing by key, to get data ordered among partitions
> (like RangePartitioner) and within partitions (like sortByKey or
> repartitionAndSortWithinPartition) pushing the sorting down to the
> shuffles machinery of the reducing phase.
> 
> I think, but maybe I'm wrong, that the correct way to do that is that
> combineByKey call setKeyOrdering function on the ShuflleRDD that it returns.
> 
> Am I wrong? Can be done by a combination of other transformations with
> the same efficiency?
> 
> Thanks,
> Marco
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> ________________________________________________________
> 
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to