Ah.. might actually. I'll have to mess around with that. On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote:
> Would `topByKey` help? > > https://github.com/apache/spark/blob/master/mllib/src/ > main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 > > Best, > Karl > > On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote: > >> I'm trying to figure out a way to group by and return the top 100 records >> in that group. >> >> Something like: >> >> SELECT TOP(100, user_id) FROM posts GROUP BY user_id; >> >> But I can't really figure out the best way to do this... >> >> There is a FIRST and LAST aggregate function but this only returns one >> column. >> >> I could do something like: >> >> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) >> LIMIT 100; >> >> But that limit is applied for ALL the records. Not each individual user. >> >> The only other thing I can think of is to do a manual map reduce and then >> have the reducer only return the top 100 each time... >> >> Would LOVE some advice here... >> >> -- >> >> We’re hiring if you know of any awesome Java Devops or Linux Operations >> Engineers! >> >> Founder/CEO Spinn3r.com >> Location: *San Francisco, CA* >> blog: http://burtonator.wordpress.com >> … or check out my Google+ profile >> <https://plus.google.com/102718274791889610666/posts> >> >> -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts>