[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Rui Li (JIRA) Sun, 14 May 2017 20:35:00 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009958#comment-16009958
 ]


Rui Li commented on HIVE-16600:
-------------------------------

[~kellyzly], my point is the behavior is different between multi insert and 
simple case, regarding whether there's extra shuffle. We disabled parallel 
order by to avoid the extra shuffle in simple case. But in multi insert, it 
seems the extra shuffle can't be avoided. So there's no point in disabling. 
Looking at the MR plan you uploaded, you can see MR doesn't disable parallel 
order by in this case.
Let me summarise my understanding and suggestions.
# For {{orderBy + Limit}} query, we have two choices: use multiple reducers to 
get global order, and then shuffle to get global limit. Or we can use single 
reducer to do the order and limit together. I think the latter is better 
because we don't actually need to get a global order.
# In our multi insert example, it seems we cannot choose the latter. I suspect 
that's because there's only one limit. [~kellyzly], could you try adding limit 
to both inserts and see what happens?
# Ideally, we should be able to use single reducer if all the inserts have 
limit (with same or similar number perhaps). If not, we shouldn't disable 
parallel order by for multi insert.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Reply via email to