Hi Avrilia, It is caused by distinct aggregations in TPC-H Q21. Because Hive adds those distinct columns in the key columns of ReduceSinkOperators and correlation optimizer only check exact same key columns right now, this query will not be optimized. The jira of this issue is https://issues.apache.org/jira/browse/HIVE-4751. If you remove distinct from those aggregation functions, you will see the optimized plan. Also, another kind of cases that the correlation optimizer does not optimize right now is that a table is used in multiple MR jobs but rows in this table are shuffled in different ways.
Thanks, Yin On Tue, Dec 10, 2013 at 8:05 PM, Avrilia Floratou < avrilia.flora...@gmail.com> wrote: > Hi, > > I'm running TPCH query 21 on Hive. 0.12 and have enabled > hive.optimize.correlation. > I could see the effect of the correlation optimizer on query 17 but when > running query 21 I don't actually see the optimizer being used. I used the > publicly available tpc-h queries for hive and merged all the intermediate > subqueries into one for Q21. In this query there is a correlation between > multiple subqueries since they all get lineitem as input. But what I > observe from the query plan and the execution of the query is that the > subqueries are executed one by one and their results are materialized > before the joins among them are executed. Is there any other parameter that > I need to set to make this work? > > Thanks, > Avrilia >