----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28930/ -----------------------------------------------------------
(Updated Dec. 18, 2014, 2:07 a.m.) Review request for hive. Changes ------- Update more golden files. Ton of SMB joins got converted to mapjoin/bucket mapjoin. Also, due to forward-walking some of the operator numbers are changed for cross-product check. Bugs: HIVE-8639 https://issues.apache.org/jira/browse/HIVE-8639 Repository: hive-git Description ------- In MapReduce for auto-SMB joins, SortedMergeJoinProc is run in the earlier Optimizer layer to convert join to SMB join, and SortMergeJoinResolver is run in later PhysicalOptimizer layer to convert it to MapJoin. For Spark, we have an opportunity to make it cleaner by deciding putting both SMB and MapJoin conversions in the logical layer and deciding which one to call. This patch introduces a new unitied join processor called 'SparkJoinOptimizer' in the logical layer. This will call 'SparkMapJoinOptimizer' and 'SparkSortMergeJoinOptimizer' in a certain order depending on the flags that are set and which ever one is available fails. Thus no need to write a SMB -> MapJoin path. 'SparkSortMergeJoinOptimizer' is a new class that wraps the logic of SortedMergeJoinProc but for Spark. To put both MapJoin/SMB processor in the same level, I had to do some fixes. 1. One fix is in 'NonBlockingOpDeDupProc', to fix the join context state, as now its run before the SMB code that relies on it. For this I submitted a trunk patch at HIVE-9060. 2. The second fix is that MapReduce's SMB code did two graph walks, one processor to calculate all 'rejected' joins, and another processor to change the non-rejected ones to SMB join. That would have made it so we do multiple walks, so I refactored the 'rejected' join logic in the same join-operator visit in SparkSortMergeJoinOptimizer. Diffs (updated) ----- ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java c2e643d ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 680c6fd ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java 83625ef ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 5e432ac ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java b6a7ac2 ql/src/test/results/clientpositive/spark/auto_join32.q.out 28c022e ql/src/test/results/clientpositive/spark/auto_join_stats.q.out bccd246 ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 842b4b3 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out 2e35c66 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_12.q.out ee37010 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out b2e928f ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 20ee657 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 0a48d00 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out 5008a3f ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out 3b081af ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out 2a11fb2 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 0d971d2 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out 9d455dc ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out 61eb6ae ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out 198d50d ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out f59e57f ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_2.q.out b58091c ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_4.q.out 8ee392e ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_6.q.out 9c119df ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_7.q.out b9ad92d ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_8.q.out ed4d03f ql/src/test/results/clientpositive/spark/cross_product_check_2.q.out 6fb69a5 ql/src/test/results/clientpositive/spark/parquet_join.q.out 240989a ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 268ae23 ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out df66cc2 ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out f635949 Diff: https://reviews.apache.org/r/28930/diff/ Testing ------- Most of the auto-smb tests give the same output with this change, the only difference is now some SMB joins become MapJoin if "hive.auto.convert.sortmerge.join.to.mapjoin" is on, as expected. One failing test is auto_sortmerge_join_9. This was passing until yesterday when bucket-map join is enabled in HIVE-8638. As expected, by choosing MapJoins over SMB join if "hive.auto.convert.sortmerge.join.to.mapjoin" is on, the MapJoin may become a bucket-mapjoin. Some of the more complicated queries of auto_sortmerge_join_9 get converted to bucket mapjoin and fail. Can probably file a new JIRA to fix this test. Thanks, Szehon Ho