[ https://issues.apache.org/jira/browse/HIVE-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858665#comment-15858665 ]
Chao Sun commented on HIVE-15489: --------------------------------- One issue with the current approach is the JOIN operator we are looking at could be impacted by upstream joins/aggregations {code} M1 M2 \ / (JOIN 1) R1 M3 \ / \ R2 \ / R3 (JOIN 2) {code} Here there are multiple reduce phases before getting to {{JOIN 2}}, which could affect the data size a lot. To minimize this inaccuracy, I propose that *we should only use TS stats if there is no RS between the JOIN and all roots reachable from it.* In the above, {{JOIN 1}} satisfies the condition while {{JOIN 2}} does not. > Alternatively use table scan stats for HoS > ------------------------------------------ > > Key: HIVE-15489 > URL: https://issues.apache.org/jira/browse/HIVE-15489 > Project: Hive > Issue Type: Improvement > Components: Spark, Statistics > Affects Versions: 2.2.0 > Reporter: Chao Sun > Assignee: Chao Sun > Attachments: HIVE-15489.1.patch, HIVE-15489.2.patch, > HIVE-15489.wip.patch > > > For MapJoin in HoS, we should provide an option to only use stats in the TS > rather than the populated stats in each of the join branch. This could be > pretty conservative but more reliable. -- This message was sent by Atlassian JIRA (v6.3.15#6346)