[ https://issues.apache.org/jira/browse/HIVE-20032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540257#comment-16540257 ]
Sahil Takiar commented on HIVE-20032: ------------------------------------- [~lirui] could you take a look? This patch also turns {{hive.spark.optimize.shuffle.serde}} on by default. I think we should try to get to a point where we never have to serialize the hashCode. It's confusing to users migrating from Hive-on-MR to HoS when they see a query that requires more shuffle data in HoS than Hive-on-MR. This is the first step towards achieving that. Doing it completely will be tricky. Off the top of my head, we will need a way to specify separate serializers for cacheing RDDs vs. shuffling them. We will also need a way to preserve the hashCode for {{groupByKey}}. > Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled > ------------------------------------------------------------------------- > > Key: HIVE-20032 > URL: https://issues.apache.org/jira/browse/HIVE-20032 > Project: Hive > Issue Type: Improvement > Components: Spark > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Priority: Major > Attachments: HIVE-20032.1.patch, HIVE-20032.2.patch, > HIVE-20032.3.patch > > > Follow up on HIVE-15104, if we don't enable RDD cacheing or groupByShuffles, > then we don't need to serialize the hashCode when shuffling data in HoS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)