[ 
https://issues.apache.org/jira/browse/HIVE-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078830#comment-14078830
 ] 

Xuefu Zhang commented on HIVE-7334:
-----------------------------------

[~lirui] Thanks for the patch. I took a brief look, and found you might need to 
rebase your patch with the latest branch. On the top level, here is the plan 
for sortBy, groupBy, and HiveReduceFunction. Also, please note that there are 
some overlap between your work and [~robustchao]'s HIVE-7526. I'd like to make 
clear so that we don't overstep each other's toe.

1. We will use groupBy unless sorting is required. For this, we need to change 
HiveReduceFunction API. (Chao)
2. Since sortBy and groupBy generate different type data sets, we will need to 
cluster rows from sortBy and match the input of HiveReduceFunction. We will 
create a subclass of SparkTran for row clustering. The cluster should be 
simpler than the existing one in HiveReduceFunction as we assume that the key 
are ordered. Thus, we accumulate rows with the same key. (Chao)
3. We have ShuffleTran for shuffling. Currently it only uses paritionByKey(). 
We will change it to groupBy. (Chao)
4. We will add logic in SparkCompiler/SparkPlanGenerator to determine which 
which shuffle to use: either groupBy + ReduceTran or sortBy + RowClusteringTran 
+ ReduceTran. (Rui)
5. Make sure Hive's order by, sort by, distributed by, and clustered by work 
(Rui).
6. It seems that we don't need partitionByKey.

Please work together with Chao to move this forward.

In addition, I'd like you to find out what takes to support shuffling required 
for Hive's reduce-side join. If there is anything missing in Spark, please 
create corresponding JIRAs.

Let me know if you have any questions.

> Create SparkShuffler, shuffling data between map-side data processing and 
> reduce-side processing
> ------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7334
>                 URL: https://issues.apache.org/jira/browse/HIVE-7334
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>         Attachments: HIVE-7334.patch
>
>
> Please refer to the design spec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to