[ 
https://issues.apache.org/jira/browse/HIVE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201691#comment-14201691
 ] 

Rui Li commented on HIVE-8542:
------------------------------

I think the problem is that the sorting keys and partition keys in RS are not 
identical. Partition key is the group-by key, but sorting keys are group-by key 
followed by distinct key. Since RangePartitioner is used to partition the data, 
and we have quite a few reducers (31), records with same group-by key can go to 
different reducers, so the final results are not properly grouped. We'll have 
correct results if #reducers is set to 1.

To reproduce, as long as the #reducers is large, any groupby+distinct query can 
reveal this issue.

> Enable groupby_map_ppr.q and groupby_map_ppr_multi_distinct.q [Spark Branch]
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-8542
>                 URL: https://issues.apache.org/jira/browse/HIVE-8542
>             Project: Hive
>          Issue Type: Test
>          Components: Spark
>            Reporter: Chao
>            Assignee: Rui Li
>
> Currently, in Spark branch, results for these two test files are very 
> different from MR's. We need to find out the cause for this, and identify 
> potential bug in our current implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to