[ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017291#comment-13017291 ]
He Yongqiang commented on HIVE-2095: ------------------------------------ Uploading a new patch to address namit's comments. Note, there is an existing bug in hive that cause results of auto_join29.q is not correct. Let's file another jira for it. basically, if the outer join filter is enabled, the query "SELECT /*+mapjoin(src1, src2)*/ * FROM src src1 RIGHT OUTER JOIN src src2 ON (src1.key = src2.key AND src1.key < 10 AND src2.key > 10) JOIN src src3 ON (src2.key = src3.key AND src3.key < 10) SORT BY src1.key, src1.value, src2.key, src2.value, src3.key, src3.value;" will give wrong results in today's hive. > auto convert map join bug > ------------------------- > > Key: HIVE-2095 > URL: https://issues.apache.org/jira/browse/HIVE-2095 > Project: Hive > Issue Type: Bug > Reporter: He Yongqiang > Assignee: He Yongqiang > Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch > > > 1) > when considering to choose one table as the big table candidate for a map > join, if at compile time, hive can find out that the total known size of all > other tables excluding the big table in consideration is bigger than a > configured value, this big table candidate is a bad one, and should not put > into plan. Otherwise, at runtime to filter this out may cause more time. > 2) > added a null check for back up tasks. Otherwise will see NullPointerException > 3) > CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise > it will make wrong decision. > 4) > changes made to the ConditionalResolverCommonJoin: added pathToAliases, > aliasToSize (alias's input size that is known at compile time, by > inputSummary), and intermediate dir path. > So the logic is, go over all the pathToAliases, and for each path, if it is > from intermediate dir path, add this path's size to all aliases. And finally > based on the size information and others like aliasToTask to choose the big > table. > 5) > Conditional task's children contains wrong options, which may cause join fail > or incorrect results. Basically when getting all possible children for the > conditional task, should use a whitelist of big tables. Only tables in this > while list can be considered as a big table. > Here is the logic: > + * Get a list of big table candidates. Only the tables in the returned set > can > + * be used as big table in the join operation. > + * > + * The logic here is to scan the join condition array from left to right. > If > + * see a inner join and the bigTableCandidates is empty, add both side of > this > + * inner join to big table candidates. If see a left outer join, and the > + * bigTableCandidates is empty, add the left side to it, and if the > + * bigTableCandidates is not empty, do nothing (which means the > + * bigTableCandidates is from left side). If see a right outer join, clear > the > + * bigTableCandidates, and add right side to the bigTableCandidates, it > means > + * the right side of a right outer join always win. If see a full outer > join, > + * return null immediately (no one can be the big table, can not do a > + * mapjoin). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira