[ https://issues.apache.org/jira/browse/HIVE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201691#comment-14201691 ]
Rui Li commented on HIVE-8542: ------------------------------ I think the problem is that the sorting keys and partition keys in RS are not identical. Partition key is the group-by key, but sorting keys are group-by key followed by distinct key. Since RangePartitioner is used to partition the data, and we have quite a few reducers (31), records with same group-by key can go to different reducers, so the final results are not properly grouped. We'll have correct results if #reducers is set to 1. To reproduce, as long as the #reducers is large, any groupby+distinct query can reveal this issue. > Enable groupby_map_ppr.q and groupby_map_ppr_multi_distinct.q [Spark Branch] > ---------------------------------------------------------------------------- > > Key: HIVE-8542 > URL: https://issues.apache.org/jira/browse/HIVE-8542 > Project: Hive > Issue Type: Test > Components: Spark > Reporter: Chao > Assignee: Rui Li > > Currently, in Spark branch, results for these two test files are very > different from MR's. We need to find out the cause for this, and identify > potential bug in our current implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)