[ 
https://issues.apache.org/jira/browse/HIVE-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834259#comment-16834259
 ] 

Vineet Garg commented on HIVE-21690:
------------------------------------

bq. Why should we consider only aggregate operators? What about other operators?
One of the reason for considering aggregate operator beside join operator is 
that similar to join operator it involves (in most of the cases) shuffling data 
and therefore it adds significant cost to the overall execution unlike other 
operators.

bq. Since cost model is pluggable, have you thought about creating a cost model 
that extends the join reordering (default) one with cost calculation for the 
Aggregate operator? You could use the new cost model when you trigger this 
rule. In a follow-up, you can study whether using the same cost model for join 
reordering makes sense or not, and evaluate the merit of that change for join 
reordering on its own.
I really like this suggestion and I think this is better approach. Like you 
suggested it is prudent to  evaluate the change in cost model change before 
making it permanent.

> Support outer joins with HiveAggregateJoinTransposeRule and turn it on by 
> default
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-21690
>                 URL: https://issues.apache.org/jira/browse/HIVE-21690
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Planning
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-21690.1.patch
>
>
> 1) This optimization is off by default. We would like to turn on this 
> optimization wherein group by is pushed down to join, in some cases top 
> aggregate is removed but in most of the cases this optimization adds extra 
> aggregate nodes. To measure if those extra aggregates are beneficial or not 
> (they might add extra overhead without reducing rows) cost is computed and 
> compared b/w previous plan and new plan.
> Since Hive's cost model only consider JOIN's cost and discard cost of rest of 
> the nodes, this comparison always favor new plan (since adding aggregate 
> beneath join reduces the total number of rows processed by the join and 
> therefore reduces the join cost). Therefore turning on this optimization with 
> existing cost model is not a good idea.
> One approach to fix this is to localize the cost computation to the rule 
> itself, i.e compute the non-cumulative cost of existing aggregate and join 
> and compare it with new cost of new aggregates, join and top aggregate.
> Better approach in my opinion would be to fix the cost model and take 
> aggregate cost into account (along with the join). This could affect other 
> queries and can cause performance regression but those will most likely be 
> issues with the planning and should be investigated and fixed.
> 2) This optimization currently only support INNER JOIN. This can be extended 
> to support OUTER joins.
>  
> cc [~jcamachorodriguez] [~ashutoshc] [~gopalv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to