[ https://issues.apache.org/jira/browse/HIVE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548501#comment-14548501 ]
Xuefu Zhang commented on HIVE-10458: ------------------------------------ 1. I think we should let hive.optimize.sampling.orderby to control parallel orderby for spark. 2. As to implementation, we have two choices: a1) Use Spark's orderByKey transformation, as what your patch #3 is doing. In this approach, Spark will do sampling and key partitioning. a2) Use Hive's approach. Hive will do sampling and set up a partitioner, and we will use Spark's transformation, repartitionAndSortWithinPartitions, and use that partitioner. (Currently this transformation is used in a different context, with a hash partitioner.) Both approaches are acceptable to me. Approach a1 seems simpler with less code to write, but more tied to Spark. I'm not sure of performance difference. It would be great to measure it, but not critical at this moment. If we take approache a1, we need to make sure that we are not doing double sampling. That is, we need to make sure that MR's sampler and total order partitioner are turned off for spark. > Enable parallel order by for spark [Spark Branch] > ------------------------------------------------- > > Key: HIVE-10458 > URL: https://issues.apache.org/jira/browse/HIVE-10458 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Rui Li > Assignee: Rui Li > Attachments: HIVE-10458.1-spark.patch, HIVE-10458.2-spark.patch, > HIVE-10458.3-spark.patch > > > We don't have to force reducer# to 1 as spark supports parallel sorting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)