[jira] [Commented] (HIVE-20032) Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled

Sahil Takiar (JIRA) Wed, 11 Jul 2018 08:37:17 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-20032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540257#comment-16540257
 ]


Sahil Takiar commented on HIVE-20032:
-------------------------------------

[~lirui] could you take a look?

This patch also turns {{hive.spark.optimize.shuffle.serde}} on by default. I 
think we should try to get to a point where we never have to serialize the 
hashCode. It's confusing to users migrating from Hive-on-MR to HoS when they 
see a query that requires more shuffle data in HoS than Hive-on-MR.

This is the first step towards achieving that. Doing it completely will be 
tricky. Off the top of my head, we will need a way to specify separate 
serializers for cacheing RDDs vs. shuffling them. We will also need a way to 
preserve the hashCode for {{groupByKey}}.

> Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled
> -------------------------------------------------------------------------
>
>                 Key: HIVE-20032
>                 URL: https://issues.apache.org/jira/browse/HIVE-20032
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-20032.1.patch, HIVE-20032.2.patch, 
> HIVE-20032.3.patch
>
>
> Follow up on HIVE-15104, if we don't enable RDD cacheing or groupByShuffles, 
> then we don't need to serialize the hashCode when shuffling data in HoS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20032) Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled

Reply via email to