[ 
https://issues.apache.org/jira/browse/HIVE-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789045#comment-17789045
 ] 

Krisztian Kasa commented on HIVE-27876:
---------------------------------------

[~rameshkumar]
1. Based on cwiki sort by ensures the record order at bucket file level.
[https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl]
Another page shares more details
[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#LanguageManualSortBy-DifferencebetweenSortByandOrderBy]

When inserting all records using one insert statement only one bucket file is 
created so global order matches the record order in that singe file.

2. The goal of this optimization is to remove the RS because shuffle takes most 
of the execution time.
Group by can distribute key values two ways
 * using a hash table. It has limitations (memory).
 * the data is sorted by the group by keys. This is ensured by the RS.

IIUC The optimization assumes if the physical order of the records in the table 
matches the group by needs then the RS can be removed. However it relies on the 
bucketing keys which according to 1) and also by the repro in description it 
doesn't ensures global ordering of the data.

> Incorrect query results on tables with ClusterBy & SortBy
> ---------------------------------------------------------
>
>                 Key: HIVE-27876
>                 URL: https://issues.apache.org/jira/browse/HIVE-27876
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Naresh P R
>            Assignee: Ramesh Kumar Thangarajan
>            Priority: Major
>              Labels: pull-request-available
>
> Repro:
>  
> {code:java}
> create external table test_bucket(age int, name string, dept string) 
> clustered by (age, name) sorted by (age asc, name asc) into 2 buckets stored 
> as orc;
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> //empty wrong results
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+-------+------+
> | age  | name  | _c2  |
> +------+-------+------+
> +------+-------+------+
> // Workaround
> set hive.map.aggr=false;
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+--------+------+
> | age  |  name  | _c2  |
> +------+--------+------+
> | 1    | user1  | 2    |
> | 2    | user2  | 2    |
> +------+--------+------+ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to