[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Phabricator updated HIVE-2206: ------------------------------ Attachment: HIVE-2206.D11097.4.patch yhuai updated the revision "HIVE-2206 [jira] add a new optimizer for query correlation discovery and optimization". address brock's comments Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D11097 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D11097?vs=34383&id=34401#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/if/queryplan.thrift ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java ql/src/test/queries/clientpositive/correlationoptimizer1.q ql/src/test/queries/clientpositive/correlationoptimizer2.q ql/src/test/queries/clientpositive/correlationoptimizer3.q ql/src/test/queries/clientpositive/correlationoptimizer4.q ql/src/test/queries/clientpositive/correlationoptimizer5.q ql/src/test/results/clientpositive/correlationoptimizer1.q.out ql/src/test/results/clientpositive/correlationoptimizer2.q.out ql/src/test/results/clientpositive/correlationoptimizer3.q.out ql/src/test/results/clientpositive/correlationoptimizer4.q.out ql/src/test/results/clientpositive/correlationoptimizer5.q.out ql/src/test/results/compiler/plan/groupby2.q.xml ql/src/test/results/compiler/plan/groupby3.q.xml To: JIRA, yhuai Cc: brock > add a new optimizer for query correlation discovery and optimization > -------------------------------------------------------------------- > > Key: HIVE-2206 > URL: https://issues.apache.org/jira/browse/HIVE-2206 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.12.0 > Reporter: He Yongqiang > Assignee: Yin Huai > Attachments: HIVE-2206.10-r1384442.patch.txt, > HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, > HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, > HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, > HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, > HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, > HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, > HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, > HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, > HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, > HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, > HIVE-2206.D11097.4.patch, testQueries.2.q, YSmartPatchForHive.patch > > > This issue proposes a new logical optimizer called Correlation Optimizer, > which is used to merge correlated MapReduce jobs (MR jobs) into a single MR > job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The > paper and slides of YSmart are linked at the bottom. > Since Hive translates queries in a sentence by sentence fashion, for every > operation which may need to shuffle the data (e.g. join and aggregation > operations), Hive will generate a MapReduce job for that operation. However, > for those operations which may need to shuffle the data, they may involve > correlations explained below and thus can be executed in a single MR job. > # Input Correlation: Multiple MR jobs have input correlation (IC) if their > input relation sets are not disjoint; > # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they > have not only input correlation, but also the same partition key; > # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its > child nodes if it has the same partition key as that child node. > The current implementation of correlation optimizer only detect correlations > among MR jobs for reduce-side join operators and reduce-side aggregation > operators (not map only aggregation). A query will be optimized if it > satisfies following conditions. > # There exists a MR job for reduce-side join operator or reduce side > aggregation operator which have JFC with all of its parents MR jobs (TCs will > be also exploited if JFC exists); > # All input tables of those correlated MR job are original input tables (not > intermediate tables generated by sub-queries); and > # No self join is involved in those correlated MR jobs. > Correlation optimizer is implemented as a logical optimizer. The main reasons > are that it only needs to manipulate the query plan tree and it can leverage > the existing component on generating MR jobs. > Current implementation can serve as a framework for correlation related > optimizations. I think that it is better than adding individual optimizers. > There are several work that can be done in future to improve this optimizer. > Here are three examples. > # Support queries only involve TC; > # Support queries in which input tables of correlated MR jobs involves > intermediate tables; and > # Optimize queries involving self join. > References: > Paper and presentation of YSmart. > Paper: > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf > Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira