Hi Robin, thanks a lot for the hint about the WITH statement. This is comparable to using a subselect. E.g., I could rewrite my query from: SELECT actor, COLLECT_LIST(action) AS actions FROM foobar GROUP BY actor SORT BY actor, ts;
to: SELECT actor, COLLECT_LIST(action) AS actions FROM ( SELECT actor, ts, action FROM foobar ORDER BY actor, ts ) foobar_sorted GROUP BY actor; This has several disadvantages: a) we have two MR jobs instead of one b) the global ORDER BY is forcing data to flow through a single reducer, which increases latency c) when reading the sorted results from the first job, data for the same reduce group can come from two different input splits. Correct sorting is not guaranteed for these reduce groups. I created a JIRA for a new UDAF COLLECT_LIST_SORTED: https://issues.apache.org/jira/browse/HIVE-11022 Best regards Michael > On 2015-06-13, at 16:52, Robin Verlangen <ro...@us2.nl> wrote: > > Hi Michael, > > You can try using a with statement, pseudo: > > WITH input AS (SELECT colA, colB FROM table ORDER BY colA ASC) > SELECT colB FROM input > > Best regards, > > Robin Verlangen > Chief Data Architect > > W http://www.robinverlangen.nl > E ro...@us2.nl > > > What is CloudPelican? > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > On Sat, Jun 13, 2015 at 3:50 PM, Michael Häusler <mich...@akatose.de> wrote: > Hi there, > > imagine you have a table of time-series transactions, e.g.: > CREATE TABLE > foobar (actor BIGINT, ts BIGINT, action STRING); > > containing the rows: > 1 2000 bar > 1 1000 foo > 2 1500 foo > > An interesting query would be to get a denormalized view on all actions of an > actor sorted by timestamp: > 1 [foo, bar] > 2 [foo] > > This would require collecting the actions in a list in non-natural order. A > naive query would look like this: > SET hive.map.aggr = false; > SELECT > actor, > COLLECT_LIST(action) AS actions > FROM > foobar > GROUP BY > actor > SORT BY > actor, ts; > > Unfortunately, as the column "ts" is not in the final result set, this > produces a SemanticException [Error 10004]: Line 9:11 Invalid table alias or > column reference 'ts'. > > > I do understand that it is difficult to allow a global ORDER BY on column > that is not in the result set. > But the SORT BY only needs to ensure order within the reduce group. Is there > a way to get this behaviour in Hive? > > Best regards > Michael >