How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD. A naive algorithm would be - 1) Sort RDD by segment & score (descending) 2) Within each segment, nu

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Aung Htet
te... > > But this is only good for top 10% or bottom 10%...if you need to do it for > top 30% then may be the shuffle version will work better... > > On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet wrote: > >> Hi all, >> >> I have a distribution represented as an

Re: How to get a top X percent of a distribution represented as RDD

2015-04-03 Thread Aung Htet
ata structure such as in >> https://github.com/laserson/dsq​ >> >> to get approximate quantiles, then use whatever values you want to filter >> the original sequence. >> -- >> *From:* Debasish Das >> *Sent:* Thursday, March 26, 2015 9: