Sahil Takiar created HIVE-20141:
-----------------------------------

             Summary: Turn hive.spark.use.groupby.shuffle off by default
                 Key: HIVE-20141
                 URL: https://issues.apache.org/jira/browse/HIVE-20141
             Project: Hive
          Issue Type: Task
          Components: Spark
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar


[~xuefuz] any thoughts on this? I think it would provide better out of the box 
behavior for Hive-on-Spark users, especially for users who are migrating from 
Hive-on-MR to HoS. Wondering what your experience with this config has been?

I've done a bunch of performance profiling with this config turned on vs. off, 
and for TPC-DS queries it doesn't make a significant difference. The main 
difference I can see is that when a Spark stage has to spill to disk, 
{{repartitionAndSortWithinPartitions}} spills more data to disk than 
{{groupByKey}} - my guess is that this happens because {{groupByKey}} stores 
everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single 
copy of the key for potentially multiple values) whereas 
{{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which 
sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results in 
more data being spilled to disk).

My understanding is that using {{repartitionAndSortWithinPartitions}} for Hive 
GROUP BYs is similar to what Hive-on-MR does. So disabling this config would 
provide a similar experience to HoMR. Furthermore, last I checked, 
{{groupByKey}} still can't spill within a row group.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to