Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Yin Huai Tue, 15 Jan 2013 19:19:20 -0800


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java,
> >  line 275
> > <https://reviews.apache.org/r/7126/diff/11/?file=221827#file221827line275>
> >
> >     Does this imply CLSRSOperator cannot have more than one child operator 
> > in any case. If so, can you please add comments stating that along with 
> > small description why is that?
> 
> Yin Huai wrote:
>     I thought that, in the original plan, a ReduceSinkOperator can only have 
> 1 child. Because a CLSRSOperator replaces a ReduceSinkOperator, it also has a 
> single child. Is my understanding correct?
> 
> Ashutosh Chauhan wrote:
>     I think you are correct. Since its a terminal operator, RSOperator can 
> only have 1 child. It will be good to add this as comment there.


done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java, line 1451
> > <https://reviews.apache.org/r/7126/diff/11/?file=221831#file221831line1451>
> >
> >     I don't get the reason of introducing rowNumber in Operator. It doesn't 
> > look its required for optimization to take place correctly. Can you 
> > elaborate the need of this variable and different associated methods which 
> > are introduced along with it?

If a query plan is optimized by correlation optimizer, multiple operation paths 
can be merged to a single Map phase. CorrelationCompositeOperator is used to 
evaluate results from different paths and forward a single row to 
ReduceSinkOperator. CorrelationCompositeOperator will first buffer rows 
forwarded to it from different paths. rowNumber is introduced to let 
CorrelationCompositeOperator know when a row has been processed by all paths. 
When a new rowNumber (a new row will come) is set through setRowNumber, 
CorrelationCompositeOperator will evaluate its current buffer and tag which 
operation paths this current row belongs to. I will add comment to explain it.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, 
> > line 134
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line134>
> >
> >     Can you add comments clarifying whats the reason for this?

done


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, 
> > line 270
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line270>
> >
> >     If I am reading this correctly than, map-side groupbys are converted to 
> > reduce-side groupbys. I don't see any fundamental reason why correlation 
> > optimizer cannot work with map-side groupbys. Is this just the limitation 
> > of current implementation or is there any fundamental reason for it. Please 
> > add comments whatever the case is.

For example, if an MR job for an aggregation operator shares the input table 
with an MR job for a join operator, with Reduce-side aggregation (RS-GBY 
pattern), the ReduceSinkOperator for the merged Map phase only needs to emit a 
single row for both the aggregation operator and the join operator. Thus, I 
decide to convert map-side groupbys to reduce-side groupbys.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java, line 94
> > <https://reviews.apache.org/r/7126/diff/11/?file=221838#file221838line94>
> >
> >     Can you add comments why it is not compatible with skew-join and 
> > groupby-skew optimizations?

seems skew optimizations will split a single MR job to 2 jobs. I have not 
carefully thought how to apply the correlation optimization on those plans 
optimized by skew optimizations. So, I'd suggest we evaluate this issue in a 
follow-up jira.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java, line 393
> > <https://reviews.apache.org/r/7126/diff/11/?file=221845#file221845line393>
> >
> >     Explain annotation should have worked, no ? Having this working will be 
> > very useful for debugging.

Enabled. We need to re-generate some test results (probably all results 
involving "EXPLAIN"?).


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, 
> > line 175
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line175>
> >
> >     Seems like the logic of auto conversion of join to map-join is same as 
> > the one used in CommonJoinResolver. If thats the case, instead of 
> > replicating the logic here, it will be good to refactor that logic out of 
> > CommonJoinResolver in some util function in some util class and then use 
> > that function here. That logic will likely change over course of time and 
> > risk getting diverged if we duplicate instead of reuse.

i am working on it.


> On Jan. 10, 2013, 2:24 a.m., Ashutosh Chauhan wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java, 
> > line 342
> > <https://reviews.apache.org/r/7126/diff/11/?file=221835#file221835line342>
> >
> >     Instead of changing the plan first and than undo the changes on the 
> > plan later on (in case optimizer can not  be applied) I am wondering a 
> > safer choice could have been to clone the plan and than do your changes on 
> > the cloned plan. If we find that plan can be optimized, only than change 
> > the original plan. This will be a bit more safer. Thoughts?

Yes, I agree. One question. Is there any convenient way to clone a ParseContext 
or an entire plan tree?


- Yin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7126/#review15124
-----------------------------------------------------------


On Nov. 19, 2012, 7:51 p.m., Yin Huai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7126/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2012, 7:51 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> This optimizer exploits intra-query correlations and merges multiple 
> correlated MapReduce jobs into one jobs. Open a new request since I have been 
> working on hive-git.
> 
> 
> This addresses bug HIVE-2206.
>     https://issues.apache.org/jira/browse/HIVE-2206
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 
>   conf/hive-default.xml.template f332f3a 
>   
> ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
>  7c4c413 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 
>   ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION 
>   ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 
>   ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 
>   ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 
>   ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 
> 
> Diff: https://reviews.apache.org/r/7126/diff/
> 
> 
> Testing
> -------
> 
> All tests pass.
> 
> 
> Thanks,
> 
> Yin Huai
> 
>

Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization

Reply via email to