[ 
https://issues.apache.org/jira/browse/HIVE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548501#comment-14548501
 ] 

Xuefu Zhang commented on HIVE-10458:
------------------------------------

1. I think we should let hive.optimize.sampling.orderby to control parallel 
orderby for spark.
2. As to implementation, we have two choices:
   a1) Use Spark's orderByKey transformation, as what your patch #3 is doing. 
In this approach, Spark will do sampling and key partitioning. 
   a2) Use Hive's approach. Hive will do sampling and set up a partitioner, and 
we will use Spark's transformation, repartitionAndSortWithinPartitions, and use 
that partitioner. (Currently this transformation is used in a different 
context, with a hash partitioner.)

Both approaches are acceptable to me. Approach a1 seems simpler with less code 
to write, but more tied to Spark. I'm not sure of performance difference. It 
would be great to measure it, but not critical at this moment.

If we take approache a1, we need to make sure that we are not doing double 
sampling. That is, we need to make sure that MR's sampler and total order 
partitioner are turned off for spark. 

> Enable parallel order by for spark [Spark Branch]
> -------------------------------------------------
>
>                 Key: HIVE-10458
>                 URL: https://issues.apache.org/jira/browse/HIVE-10458
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-10458.1-spark.patch, HIVE-10458.2-spark.patch, 
> HIVE-10458.3-spark.patch
>
>
> We don't have to force reducer# to 1 as spark supports parallel sorting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to