Thanks, Silvio,
If we write
schemaRDD.map(row => (key, row))
.groupBy(key)
.map((key, rows) => row) // take the first row from Iterable[ROW]
We get an RDD[ROW], however, we need a SchemaRDD for following query.
In our case, the ROW has about 80 columns which exceeds the case class
limit.
2014-08-21 21:05 GMT+08:00 Silvio Fiorito <[email protected]>:
> Yeah, unfortunately SparkSQL is missing a lot of the nice analytical
> functions in Hive. But using a combo of SQL and Spark operations you should
> be able to run the basic SQL, then do a groupBy on the SchemaRDD, then for
> each group just take the first record.
>
> From: Fengyun RAO <[email protected]>
> Date: Thursday, August 21, 2014 at 8:26 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: [Spark SQL] How to select first row in each GROUP BY group?
>
> Could anybody help? I googled and read a lot, but didn’t find anything
> helpful.
>
> or to make the question simple:
>
> *How to set row number for each group? *
>
> SELECT a,
> ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table.
>
> 2014-08-20 15:52 GMT+08:00 Fengyun RAO <[email protected]>:
>
> I have a table with 4 columns: a, b, c, time
>>
>> What I need is something like:
>>
>> SELECT a, b, GroupFirst(c)
>> FROM t
>> GROUP BY a, b
>>
>> GroupFirst means "the first" item of column c group,
>> and by "the first" I mean minimal "time" in that group.
>>
>>
>> In Oracle/Sql Server, we could write:
>>
>> WITH summary AS (
>> SELECT a,
>> b, c,
>> ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY time) AS
>> num
>> FROM t)SELECT s.*FROM summary sWHERE s.num = 1
>>
>> but in Spark SQL, there is no such thing as ROW_NUMBER()
>>
>> I wonder how to make it.
>>
>>
>>
>