Looks like you can do it with dense_rank functions. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I setup some basic records and seems like it did the right thing. Now time to throw 50TB and 100 spark nodes at this problem and see what happens :) On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote: > Ah.. might actually. I'll have to mess around with that. > > On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote: > >> Would `topByKey` help? >> >> https://github.com/apache/spark/blob/master/mllib/src/main/ >> scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 >> >> Best, >> Karl >> >> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote: >> >>> I'm trying to figure out a way to group by and return the top 100 >>> records in that group. >>> >>> Something like: >>> >>> SELECT TOP(100, user_id) FROM posts GROUP BY user_id; >>> >>> But I can't really figure out the best way to do this... >>> >>> There is a FIRST and LAST aggregate function but this only returns one >>> column. >>> >>> I could do something like: >>> >>> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) >>> LIMIT 100; >>> >>> But that limit is applied for ALL the records. Not each individual user. >>> >>> >>> The only other thing I can think of is to do a manual map reduce and >>> then have the reducer only return the top 100 each time... >>> >>> Would LOVE some advice here... >>> >>> -- >>> >>> We’re hiring if you know of any awesome Java Devops or Linux Operations >>> Engineers! >>> >>> Founder/CEO Spinn3r.com >>> Location: *San Francisco, CA* >>> blog: http://burtonator.wordpress.com >>> … or check out my Google+ profile >>> <https://plus.google.com/102718274791889610666/posts> >>> >>> > > > -- > > We’re hiring if you know of any awesome Java Devops or Linux Operations > Engineers! > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > > -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts>