[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Xuefu Zhang (JIRA) Thu, 03 Nov 2016 11:53:35 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633820#comment-15633820
 ]


Xuefu Zhang commented on HIVE-15104:
------------------------------------

[~lirui], thanks for sharing your findings. Can you confirm that Spark also 
uses BytesWritable.hashcode() to partition the RS output rows? If this is true, 
then there is no difference for Spark because the actual object Hive passed to 
Spark by RS is HiveKey, whose hashcode will be used for partitioning. 

If this is the case, then we should be able to define the output of our map 
function and reduce function just as <BytesWritable, BytesWritable>, for which 
we don't need a custom serializer because we don't need to declare the type as 
<HiveKey, BytesWritable>. It seems that there is still a gap in our 
understanding.



> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Aihua Xu
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Reply via email to