Re: [Spark SQL] How to select first row in each GROUP BY group?

Fengyun RAO Thu, 21 Aug 2014 20:27:24 -0700

Thanks, Silvio,

If we write


schemaRDD.map(row => (key, row))
         .groupBy(key)
         .map((key, rows) => row)  // take the first row from Iterable[ROW]

We get an RDD[ROW], however, we need a SchemaRDD for following query.

In our case, the ROW has about 80 columns which exceeds the case class
limit.



2014-08-21 21:05 GMT+08:00 Silvio Fiorito <[email protected]>:

>  Yeah, unfortunately SparkSQL is missing a lot of the nice analytical
> functions in Hive. But using a combo of SQL and Spark operations you should
> be able to run the basic SQL, then do a groupBy on the SchemaRDD, then for
> each group just take the first record.
>
>   From: Fengyun RAO <[email protected]>
> Date: Thursday, August 21, 2014 at 8:26 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: [Spark SQL] How to select first row in each GROUP BY group?
>
>    Could anybody help? I googled and read a lot, but didn’t find anything
> helpful.
>
> or to make the question simple:
>
> *How to set row number for each group? *
>
> SELECT a,
>        ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table.
>
> 2014-08-20 15:52 GMT+08:00 Fengyun RAO <[email protected]>:
>
>   I have a table with 4 columns: a, b, c, time
>>
>>  What I need is something like:
>>
>>  SELECT a, b, GroupFirst(c)
>> FROM t
>> GROUP BY a, b
>>
>>  GroupFirst means "the first" item of column c group,
>> and by "the first" I mean minimal "time" in that group.
>>
>>
>>  In Oracle/Sql Server, we could write:
>>
>> WITH summary AS (
>>     SELECT a,
>>            b,            c,
>>            ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY time) AS
>>  num
>>     FROM t)SELECT s.*FROM summary sWHERE s.num = 1
>>
>>  but in Spark SQL, there is no such thing as ROW_NUMBER()
>>
>>  I wonder how to make it.
>>
>>
>>     
>

Re: [Spark SQL] How to select first row in each GROUP BY group?

Reply via email to