Sahil Takiar created HIVE-20270:
-----------------------------------

             Summary: Don't serialize hashCode for groupByKey
                 Key: HIVE-20270
                 URL: https://issues.apache.org/jira/browse/HIVE-20270
             Project: Hive
          Issue Type: Bug
          Components: Spark
            Reporter: Sahil Takiar
            Assignee: Sahil Takiar


Similar to HIVE-20032, but for {{groupByKey}}. The tricky part with 
{{groupByKey}} is we need to preserve the {{hashCode}} until the key gets 
partitioned (via the {{HashPartitioner}}) but after that we don't really need 
to preserve the {{hashCode}}. The {{groupByKey}} operator in Spark does require 
a {{hashCode}} since it puts everything in a map, but it can use a different 
hash-code than the one specified in {{HiveKey}}. The hashcode in {{HiveKey}} is 
only important for determining the partition the key should be assigned to.

The drawback is that computing the hashcode for each {{HiveKey}} might require 
more CPU resources, but we should profile it just in case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to