[ https://issues.apache.org/jira/browse/HIVE-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xuefu Zhang updated HIVE-15683: ------------------------------- Attachment: HIVE-15683.patch Patch brought back the old implementation and provide a configuration to switch on the new implementation. > Measure performance impact on group by by HIVE-15580 > ---------------------------------------------------- > > Key: HIVE-15683 > URL: https://issues.apache.org/jira/browse/HIVE-15683 > Project: Hive > Issue Type: Improvement > Components: Spark > Affects Versions: 2.2.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-15683.patch > > > HIVE-15580 changed the way the data is shuffled for order by: instead of > using Spark's groupByKey to shuffle data, Hive on Spark now uses > repartitionAndSortWithinPartitions(), which generates (key, value) pairs > instead of original (key, value iterator). This might have some performance > implications, but it's needed to get rid of unbound memory usage by > {{groupByKey}}. > Here we'd like to compare group by performance with or w/o HIVE-15580. If the > impact is significant, we can provide a configuration that allows user to > switch back to the original way of shuffling. > This work should be ideally done after HIVE-15682 as the optimization there > should help the performance here as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)