[jira] [Updated] (HIVE-15683) Measure performance impact on group by by HIVE-15580

Xuefu Zhang (JIRA) Tue, 07 Feb 2017 20:21:57 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xuefu Zhang updated HIVE-15683:
-------------------------------
    Attachment: HIVE-15683.patch

Patch brought back the old implementation and provide a configuration to switch 
on the new implementation.

> Measure performance impact on group by by HIVE-15580
> ----------------------------------------------------
>
>                 Key: HIVE-15683
>                 URL: https://issues.apache.org/jira/browse/HIVE-15683
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>    Affects Versions: 2.2.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15683.patch
>
>
> HIVE-15580 changed the way the data is shuffled for order by: instead of 
> using Spark's groupByKey to shuffle data, Hive on Spark now uses 
> repartitionAndSortWithinPartitions(), which generates (key, value) pairs 
> instead of original (key, value iterator). This might have some performance 
> implications, but it's needed to get rid of unbound memory usage by 
> {{groupByKey}}.
> Here we'd like to compare group by performance with or w/o HIVE-15580. If the 
> impact is significant, we can provide a configuration that allows user to 
> switch back to the original way of shuffling.
> This work should be ideally done after HIVE-15682 as the optimization there 
> should help the performance here as well. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-15683) Measure performance impact on group by by HIVE-15580

Reply via email to