Yeah, unfortunately SparkSQL is missing a lot of the nice analytical functions in Hive. But using a combo of SQL and Spark operations you should be able to run the basic SQL, then do a groupBy on the SchemaRDD, then for each group just take the first record.
From: Fengyun RAO <[email protected]<mailto:[email protected]>> Date: Thursday, August 21, 2014 at 8:26 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Spark SQL] How to select first row in each GROUP BY group? Could anybody help? I googled and read a lot, but didn’t find anything helpful. or to make the question simple: How to set row number for each group? SELECT a, ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table. 2014-08-20 15:52 GMT+08:00 Fengyun RAO <[email protected]<mailto:[email protected]>>: I have a table with 4 columns: a, b, c, time What I need is something like: SELECT a, b, GroupFirst(c) FROM t GROUP BY a, b GroupFirst means "the first" item of column c group, and by "the first" I mean minimal "time" in that group. In Oracle/Sql Server, we could write: WITH summary AS ( SELECT a, b, c, ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY time) AS num FROM t)SELECT s.*FROM summary s WHERE s.num = 1 but in Spark SQL, there is no such thing as ROW_NUMBER() I wonder how to make it.
