[ https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833477#comment-13833477 ]
Sun Rui commented on HIVE-5891: ------------------------------- [~yhuai] Yes, "$JOIN_INTERMEDIATE" is intended to name an intermediate dataset during the processing of join, not for input dataset. Is there anything wrong with my understanding? "$JOIN_STREAM" is also another candidate maybe. GenMapRedUtils.java: {code} private static boolean needsTagging(ReduceWork rWork) { return rWork != null && (rWork.getReducer().getClass() == JoinOperator.class || rWork.getReducer().getClass() == DemuxOperator.class); } splitTasks(): if (needsTagging(cplan.getReduceWork())) { String origStreamDesc; Operator<? extends OperatorDesc> joinOp = cplan.getReduceWork().getReducer(); QBJoinTree joinTree = parseCtx.getJoinContext().get(joinOp); if (joinTree != null) { streamDesc = joinTree.getJoinStreamDesc(); } else { streamDesc = "$INTNAME"; } {code} Look at the above piece of code, an alias is also needed when the reduce work starts with a DemuxOperator. For this case, "$INTNAME" is still un-changed because I have no idea how DemuxOperator works. > Alias conflict when merging multiple mapjoin tasks into their common child > mapred task > -------------------------------------------------------------------------------------- > > Key: HIVE-5891 > URL: https://issues.apache.org/jira/browse/HIVE-5891 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.12.0 > Reporter: Sun Rui > Assignee: Sun Rui > Attachments: HIVE-5891.1.patch > > > Use the following test case with HIVE 0.12: > {quote} > create table src(key int, value string); > load data local inpath 'src/data/files/kv1.txt' overwrite into table src; > select * from ( > select c.key from > (select a.key from src a join src b on a.key=b.key group by a.key) tmp > join src c on tmp.key=c.key > union all > select c.key from > (select a.key from src a join src b on a.key=b.key group by a.key) tmp > join src c on tmp.key=c.key > ) x; > {quote} > We will get a NullPointerException from Union Operator: > {quote} > java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: > Hive Runtime Error while processing row {"_col0":0} > at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row {"_col0":0} > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544) > at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157) > ... 4 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652) > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655) > at > org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758) > at > org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) > at > org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534) > ... 5 more > {quote} > > The root cause is in > CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask(). > +--------------+ +--------------+ > | MapJoin task | | MapJoin task | > +--------------+ +--------------+ > \ / > \ / > +--------------+ > | Union task | > +--------------+ > > CommonJoinTaskDispatcher merges the two MapJoin tasks into their common > child: Union task. The two MapJoin tasks have the same alias name for their > big tables: $INTNAME, which is the name of the temporary table of a join > stream. The aliasToWork map uses alias as key, so eventually only the MapJoin > operator tree of one MapJoin task is saved into the aliasToWork map of the > Union task, while the MapJoin operator tree of another MapJoin task is lost. > As a result, Union operator won't be initialized because not all of its > parents gets intialized (The Union operator itself indicates it has two > parents, but actually it has only 1 parent because another parent is lost). > This issue does not exist in HIVE 0.11 and thus is a regression bug in HIVE > 0.12. > The propsed solution is to use the query ID as prefix for the join stream > name to avoid conflict and add sanity check code in CommonJoinTaskDispatcher > that merge of a MapJoin task into its child MapRed task is skipped if there > is any alias conflict. Please review the patch. I am not sure if the patch > properly handles the case of DemuxOperator. > BTW, anyone knows the origin of "$INTNAME"? it is so confusing, maybe we can > replace it with a meaningful name. -- This message was sent by Atlassian JIRA (v6.1#6144)