[ https://issues.apache.org/jira/browse/HIVE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046396#comment-16046396 ]
Rui Li commented on HIVE-6348: ------------------------------ Hi [~vgarg], it turns out we do need the sub query being sorted in some cases. An example is input20.q: {code} EXPLAIN FROM ( FROM src MAP src.key, src.key USING 'cat' DISTRIBUTE BY key SORT BY key, value ) tmap INSERT OVERWRITE TABLE dest1 REDUCE tmap.key, tmap.value USING 'python input20_script.py' AS key, value; {code} The query plan is: {noformat} Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: src Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: key (type: string), key (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Transform Operator command: cat output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: string), _col1 (type: string) Reduce Operator Tree: Select Operator expressions: VALUE._col0 (type: string), VALUE._col1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Transform Operator command: python input20_script.py output info: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: UDFToInteger(_col0) (type: int), _col1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.dest1 {noformat} The script {{input20_script.py}} counts the number of times each <Key, Value> appears and expects the input to be sorted. I'll consider how to handle such a case. > Order by/Sort by in subquery > ---------------------------- > > Key: HIVE-6348 > URL: https://issues.apache.org/jira/browse/HIVE-6348 > Project: Hive > Issue Type: Bug > Reporter: Gunther Hagleitner > Assignee: Rui Li > Priority: Minor > Labels: sub-query > Attachments: HIVE-6348.1.patch, HIVE-6348.2.patch, HIVE-6348.3.patch > > > select * from (select * from foo order by c asc) bar order by c desc; > in hive sorts the data set twice. The optimizer should probably remove any > order by/sort by in the sub query unless you use 'limit '. Could even go so > far as barring it at the semantic level. -- This message was sent by Atlassian JIRA (v6.4.14#64029)