[ 
https://issues.apache.org/jira/browse/HIVE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778947#comment-13778947
 ] 

Yin Huai commented on HIVE-5358:
--------------------------------

[~ashutoshc] I think that for those two cases with 
hive.optimize.correlation=true, the ordering of key columns does not matter. 
Because in those queries, we only need to group rows, either [key, value] or 
[value, key] should be fine for the RS. The reason that I preserved the 
ordering in Correlation Optimizer is ReduceSinkDeDuplication can merge the RS 
for ORDER BY with another RS (for example, GROUP BY). In this case, ordering 
matters. When Correlation Optimizer gets the operator tree, it does not know if 
the key columns in a RS is only used for grouping or those columns are also 
used for ordering. I think it may be better to annotate what columns are used 
for grouping and what columns are used for sorting.

[~chenchun] For your change, what will be the plan for the following query?
{code}
select c3, c2 from (select c1, c2, c3 from t1 order by c1, c2, c3) t group by 
c3, c2;
{code}
If we use [c1, c2, c3] as the key columns, rows with the same [c3, c2] are not 
grouped at the reduce side.

Based on my understanding, right now, the checkExprs in ReduceSinkDeDuplication 
only wants to handle cases that ckeys starts with pkeys, or pkeys starts with 
ckeys. For example, pkeys = [c1, c2, c3], and ckeys = [c1, c2].

                
> ReduceSinkDeDuplication should ignore column orders when check overlapping 
> part of keys between parent and child
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5358
>                 URL: https://issues.apache.org/jira/browse/HIVE-5358
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Chun Chen
>            Assignee: Chun Chen
>         Attachments: D13113.1.patch, HIVE-5358.2.patch, HIVE-5358.patch
>
>
> {code}
> select key, value from (select key, value from src group by key, value) t 
> group by key, value;
> {code}
> This can be optimized by ReduceSinkDeDuplication
> {code}
> select key, value from (select key, value from src group by key, value) t 
> group by value, key;
> {code}
> However the sql above can't be optimized by ReduceSinkDeDuplication currently 
> due to different column orders of parent and child operator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to