Hi all,
I have a distribution represented as an RDD of tuples, in rows of (segment,
score)
For each segment, I want to discard tuples with top X percent scores. This
seems hard to do in Spark RDD.
A naive algorithm would be -
1) Sort RDD by segment & score (descending)
2) Within each segment, nu
te...
>
> But this is only good for top 10% or bottom 10%...if you need to do it for
> top 30% then may be the shuffle version will work better...
>
> On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet wrote:
>
>> Hi all,
>>
>> I have a distribution represented as an
ata structure such as in
>> https://github.com/laserson/dsq
>>
>> to get approximate quantiles, then use whatever values you want to filter
>> the original sequence.
>> --
>> *From:* Debasish Das
>> *Sent:* Thursday, March 26, 2015 9: