----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28930/#review65528 -----------------------------------------------------------
Ship it! Ship It! - Xuefu Zhang On Dec. 18, 2014, 2:07 a.m., Szehon Ho wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/28930/ > ----------------------------------------------------------- > > (Updated Dec. 18, 2014, 2:07 a.m.) > > > Review request for hive. > > > Bugs: HIVE-8639 > https://issues.apache.org/jira/browse/HIVE-8639 > > > Repository: hive-git > > > Description > ------- > > In MapReduce for auto-SMB joins, SortedMergeJoinProc is run in the earlier > Optimizer layer to convert join to SMB join, and SortMergeJoinResolver is run > in later PhysicalOptimizer layer to convert it to MapJoin. For Spark, we > have an opportunity to make it cleaner by deciding putting both SMB and > MapJoin conversions in the logical layer and deciding which one to call. > > This patch introduces a new unitied join processor called > 'SparkJoinOptimizer' in the logical layer. This will call > 'SparkMapJoinOptimizer' and 'SparkSortMergeJoinOptimizer' in a certain order > depending on the flags that are set and which ever one is available fails. > Thus no need to write a SMB -> MapJoin path. > > 'SparkSortMergeJoinOptimizer' is a new class that wraps the logic of > SortedMergeJoinProc but for Spark. To put both MapJoin/SMB processor in the > same level, I had to do some fixes. > > 1. One fix is in 'NonBlockingOpDeDupProc', to fix the join context state, as > now its run before the SMB code that relies on it. For this I submitted a > trunk patch at HIVE-9060. > 2. The second fix is that MapReduce's SMB code did two graph walks, one > processor to calculate all 'rejected' joins, and another processor to change > the non-rejected ones to SMB join. That would have made it so we do multiple > walks, so I refactored the 'rejected' join logic in the same join-operator > visit in SparkSortMergeJoinOptimizer. > > > Diffs > ----- > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java c2e643d > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java > 680c6fd > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java > 83625ef > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 5e432ac > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java > b6a7ac2 > ql/src/test/results/clientpositive/spark/auto_join32.q.out 28c022e > ql/src/test/results/clientpositive/spark/auto_join_stats.q.out bccd246 > ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 842b4b3 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out > 2e35c66 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_12.q.out > ee37010 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out > b2e928f > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out > 20ee657 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out > 0a48d00 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out > 5008a3f > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out > 3b081af > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out > 2a11fb2 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out > 0d971d2 > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out > 9d455dc > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out > 61eb6ae > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out > 198d50d > ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out > f59e57f > ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_2.q.out > b58091c > ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_4.q.out > 8ee392e > ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_6.q.out > 9c119df > ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_7.q.out > b9ad92d > ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_8.q.out > ed4d03f > ql/src/test/results/clientpositive/spark/cross_product_check_2.q.out > 6fb69a5 > ql/src/test/results/clientpositive/spark/parquet_join.q.out 240989a > ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 268ae23 > ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out df66cc2 > ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out f635949 > > Diff: https://reviews.apache.org/r/28930/diff/ > > > Testing > ------- > > Most of the auto-smb tests give the same output with this change, the only > difference is now some SMB joins become MapJoin if > "hive.auto.convert.sortmerge.join.to.mapjoin" is on, as expected. > > One failing test is auto_sortmerge_join_9. This was passing until yesterday > when bucket-map join is enabled in HIVE-8638. As expected, by choosing > MapJoins over SMB join if "hive.auto.convert.sortmerge.join.to.mapjoin" is > on, the MapJoin may become a bucket-mapjoin. Some of the more complicated > queries of auto_sortmerge_join_9 get converted to bucket mapjoin and fail. > Can probably file a new JIRA to fix this test. > > > Thanks, > > Szehon Ho > >