Re: [Spark SQL] How to select first row in each GROUP BY group?

Silvio Fiorito Thu, 21 Aug 2014 06:07:10 -0700

Yeah, unfortunately SparkSQL is missing a lot of the nice analytical functions 
in Hive. But using a combo of SQL and Spark operations you should be able to 
run the basic SQL, then do a groupBy on the SchemaRDD, then for each group just 
take the first record.

From: Fengyun RAO <[email protected]<mailto:[email protected]>>
Date: Thursday, August 21, 2014 at 8:26 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [Spark SQL] How to select first row in each GROUP BY group?

Could anybody help? I googled and read a lot, but didn’t find anything helpful.

or to make the question simple:

How to set row number for each group?

SELECT a,
       ROW_NUMBER() OVER (PARTITION BY a) AS num
FROM table.

2014-08-20 15:52 GMT+08:00 Fengyun RAO 
<[email protected]<mailto:[email protected]>>:

I have a table with 4 columns: a, b, c, time

What I need is something like:

SELECT a, b, GroupFirst(c)
FROM t
GROUP BY a, b

GroupFirst means "the first" item of column c group,
and by "the first" I mean minimal "time" in that group.

In Oracle/Sql Server, we could write:

WITH summary AS (
    SELECT a,
           b,
           c,
           ROW_NUMBER() OVER(PARTITION BY a, b ORDER BY time) AS num
    FROM t)SELECT s.*FROM summary s
WHERE s.num = 1

but in Spark SQL, there is no such thing as ROW_NUMBER()

I wonder how to make it.

Re: [Spark SQL] How to select first row in each GROUP BY group?

Reply via email to