Ah.. might actually. I'll have to mess around with that.

On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote:

> Would `topByKey` help?
>
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>
> Best,
> Karl
>
> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I'm trying to figure out a way to group by and return the top 100 records
>> in that group.
>>
>> Something like:
>>
>> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>>
>> But I can't really figure out the best way to do this...
>>
>> There is a FIRST and LAST aggregate function but this only returns one
>> column.
>>
>> I could do something like:
>>
>> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
>> LIMIT 100;
>>
>> But that limit is applied for ALL the records. Not each individual user.
>>
>> The only other thing I can think of is to do a manual map reduce and then
>> have the reducer only return the top 100 each time...
>>
>> Would LOVE some advice here...
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Reply via email to