Re: Question on correlation optimizer

Yin Huai Tue, 10 Dec 2013 19:41:04 -0800

Hi Avrilia,

It is caused by distinct aggregations in TPC-H Q21. Because Hive adds those
distinct columns in the key columns of ReduceSinkOperators and correlation
optimizer only check exact same key columns right now, this query will not
be optimized. The jira of this issue is
https://issues.apache.org/jira/browse/HIVE-4751. If you remove distinct
from those aggregation functions, you will see the optimized plan. Also,
another kind of cases that the correlation optimizer does not optimize
right now is that a table is used in multiple MR jobs but rows in this
table are shuffled in different ways.


Thanks,

Yin


On Tue, Dec 10, 2013 at 8:05 PM, Avrilia Floratou <
avrilia.flora...@gmail.com> wrote:

> Hi,
>
> I'm running TPCH query 21 on Hive. 0.12 and have enabled 
> hive.optimize.correlation.
> I could see the effect of the correlation optimizer on query 17 but when
> running query 21 I don't actually see the optimizer being used. I used the
> publicly available tpc-h queries for hive and merged all the intermediate
> subqueries into one for Q21. In this query there is a correlation between
> multiple subqueries since they all get lineitem as input. But what I
> observe from the query plan and the execution of the query is that the
> subqueries are executed one by one and their results are materialized
> before the joins among them are executed. Is there any other parameter that
> I need to set to make this work?
>
> Thanks,
> Avrilia
>

Re: Question on correlation optimizer

Reply via email to