[ https://issues.apache.org/jira/browse/HIVE-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xuefu Zhang updated HIVE-15683: ------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 Release Note: Document the new configuration for 2.2.0. Status: Resolved (was: Patch Available) Committed to master. Thanks for the review, Chao! > Make what's done in HIVE-15580 for group by configurable > -------------------------------------------------------- > > Key: HIVE-15683 > URL: https://issues.apache.org/jira/browse/HIVE-15683 > Project: Hive > Issue Type: Improvement > Components: Spark > Affects Versions: 2.2.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Labels: TODOC2.2 > Fix For: 2.2.0 > > Attachments: HIVE-15683.1.patch, HIVE-15683.2.patch, HIVE-15683.patch > > > HIVE-15580 changed the way the data is shuffled for group by: instead of > using Spark's groupByKey to shuffle data, Hive on Spark now uses > repartitionAndSortWithinPartitions(), which generates (key, value) pairs > instead of original (key, value iterator). This might have some performance > implications, but it's needed to get rid of unbound memory usage by > {{groupByKey}}. > Here we'd like to compare group by performance with or w/o HIVE-15580. If the > impact is significant, we can provide a configuration that allows user to > switch back to the original way of shuffling. > This work should be ideally done after HIVE-15682 as the optimization there > should help the performance here as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)