[ 
https://issues.apache.org/jira/browse/HIVE-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858665#comment-15858665
 ] 

Chao Sun commented on HIVE-15489:
---------------------------------

One issue with the current approach is the JOIN operator we are looking at 
could be impacted by upstream joins/aggregations

{code}
      M1   M2
       \  /
(JOIN 1) R1     M3
         \     /
          \   R2
           \ /
            R3 (JOIN 2)
{code}
Here there are multiple reduce phases before getting to {{JOIN 2}}, which could 
affect the data size a lot.
To minimize this inaccuracy, I propose that *we should only use TS stats if 
there is no RS between the JOIN and all roots reachable from it.*
In the above, {{JOIN 1}} satisfies the condition while {{JOIN 2}} does not.

> Alternatively use table scan stats for HoS
> ------------------------------------------
>
>                 Key: HIVE-15489
>                 URL: https://issues.apache.org/jira/browse/HIVE-15489
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark, Statistics
>    Affects Versions: 2.2.0
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>         Attachments: HIVE-15489.1.patch, HIVE-15489.2.patch, 
> HIVE-15489.wip.patch
>
>
> For MapJoin in HoS, we should provide an option to only use stats in the TS 
> rather than the populated stats in each of the join branch. This could be 
> pretty conservative but more reliable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to