[ https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006069#comment-16006069 ]
liyunzhang_intel commented on HIVE-16600: ----------------------------------------- [~lirui]: i did not copy all the plan of mr. but in the attachment mr.explain.log.HIVE-16600. But we can see that in the reduce operator tree of Stage-2, it contains two select operators which contains multi insert. This shows that there is no extra stage in mr mode. {code} Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: string), VALUE._col1 (type: string) outputColumnNames: _col0, _col2 Select Operator expressions: _col0 (type: string), _col2 (type: string) outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 1 directory: hdfs://bdpe41:8020/user/hive/warehouse/e1/.hive-staging_hive_2017-05-11_15-03-16_656_7584605594727686973-1/-ext-10000 NumFilesPerFileSink: 1 Stats Publishing Key Prefix: hdfs://bdpe41:8020/user/hive/warehouse/e1/.hive-staging_hive_2017-05-11_15-03-16_656_7584605594727686973-1/-ext-10000/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"} bucket_count -1 column.name.delimiter , columns key,value columns.comments columns.types string:string file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://bdpe41:8020/user/hive/warehouse/e1 name default.e1 numFiles 0 numRows 0 rawDataSize 0 serialization.ddl struct e1 { string key, string value} serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 0 transient_lastDdlTime 1494486196 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.e1 TotalFiles: 1 GatherStats: true MultiFileSpray: false Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Limit Number of rows: 10 File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://bdpe41:8020/tmp/hive/root/7f5afded-5f75-46f9-a588-e7317a8decca/hive_2017-05-11_15-03-16_656_7584605594727686973-1/-mr-10004 NumFilesPerFileSink: 1 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: column.name.delimiter , columns _col0 columns.types string escape.delim \ serialization.lib org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false {code} > Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel > order by in multi_insert cases > -------------------------------------------------------------------------------------------------------- > > Key: HIVE-16600 > URL: https://issues.apache.org/jira/browse/HIVE-16600 > Project: Hive > Issue Type: Sub-task > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, > mr.explain.log.HIVE-16600 > > > multi_insert_gby.case.q > {code} > set hive.exec.reducers.bytes.per.reducer=256; > set hive.optimize.sampling.orderby=true; > drop table if exists e1; > drop table if exists e2; > create table e1 (key string, value string); > create table e2 (key string); > FROM (select key, cast(key as double) as keyD, value from src order by key) a > INSERT OVERWRITE TABLE e1 > SELECT key, value > INSERT OVERWRITE TABLE e2 > SELECT key; > select * from e1; > select * from e2; > {code} > the parallelism of Sort is 1 even we enable parallel order > by("hive.optimize.sampling.orderby" is set as "true"). This is not > reasonable because the parallelism should be calcuated by > [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170] > this is because SetSparkReducerParallelism#needSetParallelism returns false > when [children size of > RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207] > is greater than 1. > in this case, the children size of {{RS[2]}} is two. > the logical plan of the case > {code} > TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5] > -SEL[6]-FS[7] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)