Sahil Takiar created HIVE-20141: ----------------------------------- Summary: Turn hive.spark.use.groupby.shuffle off by default Key: HIVE-20141 URL: https://issues.apache.org/jira/browse/HIVE-20141 Project: Hive Issue Type: Task Components: Spark Reporter: Sahil Takiar Assignee: Sahil Takiar
[~xuefuz] any thoughts on this? I think it would provide better out of the box behavior for Hive-on-Spark users, especially for users who are migrating from Hive-on-MR to HoS. Wondering what your experience with this config has been? I've done a bunch of performance profiling with this config turned on vs. off, and for TPC-DS queries it doesn't make a significant difference. The main difference I can see is that when a Spark stage has to spill to disk, {{repartitionAndSortWithinPartitions}} spills more data to disk than {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single copy of the key for potentially multiple values) whereas {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results in more data being spilled to disk). My understanding is that using {{repartitionAndSortWithinPartitions}} for Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config would provide a similar experience to HoMR. Furthermore, last I checked, {{groupByKey}} still can't spill within a row group. -- This message was sent by Atlassian JIRA (v7.6.3#76005)