Hi Robin,

thanks a lot for the hint about the WITH statement. This is comparable to using 
a subselect.
E.g., I could rewrite my query from:
SELECT
    actor,
    COLLECT_LIST(action) AS actions
FROM
    foobar
GROUP BY
    actor
SORT BY
    actor, ts;

to:
SELECT
    actor,
    COLLECT_LIST(action) AS actions
FROM
    (
        SELECT
            actor,
            ts,
            action
        FROM
            foobar
        ORDER BY
            actor,
            ts
    ) foobar_sorted
GROUP BY
    actor;

This has several disadvantages:
a) we have two MR jobs instead of one
b) the global ORDER BY is forcing data to flow through a single reducer, which 
increases latency
c) when reading the sorted results from the first job, data for the same reduce 
group can come from two different input splits. Correct sorting is not 
guaranteed for these reduce groups.

I created a JIRA for a new UDAF COLLECT_LIST_SORTED:
https://issues.apache.org/jira/browse/HIVE-11022

Best regards
Michael



> On 2015-06-13, at 16:52, Robin Verlangen <ro...@us2.nl> wrote:
> 
> Hi Michael,
> 
> You can try using a with statement, pseudo:
> 
> WITH input AS (SELECT colA, colB FROM table ORDER BY colA ASC)
> SELECT colB FROM input
> 
> Best regards, 
> 
> Robin Verlangen
> Chief Data Architect
> 
> W http://www.robinverlangen.nl
> E ro...@us2.nl
> 
> 
> What is CloudPelican?
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.
> 
> On Sat, Jun 13, 2015 at 3:50 PM, Michael Häusler <mich...@akatose.de> wrote:
> Hi there,
> 
> imagine you have a table of time-series transactions, e.g.:
> CREATE TABLE
>     foobar (actor BIGINT, ts BIGINT, action STRING);
> 
> containing the rows:
> 1       2000    bar
> 1       1000    foo
> 2       1500    foo
> 
> An interesting query would be to get a denormalized view on all actions of an 
> actor sorted by timestamp:
> 1       [foo, bar]
> 2       [foo]
> 
> This would require collecting the actions in a list in non-natural order. A 
> naive query would look like this:
> SET hive.map.aggr = false;
> SELECT
>     actor,
>     COLLECT_LIST(action) AS actions
> FROM
>     foobar
> GROUP BY
>     actor
> SORT BY
>     actor, ts;
> 
> Unfortunately, as the column "ts" is not in the final result set, this 
> produces a SemanticException [Error 10004]: Line 9:11 Invalid table alias or 
> column reference 'ts'.
> 
> 
> I do understand that it is difficult to allow a global ORDER BY on column 
> that is not in the result set.
> But the SORT BY only needs to ensure order within the reduce group. Is there 
> a way to get this behaviour in Hive?
> 
> Best regards
> Michael
> 

Reply via email to