----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28930/ -----------------------------------------------------------
(Updated Dec. 17, 2014, 11:56 p.m.) Review request for hive. Changes ------- Address review comments, update some golden files, and fix another issue. The issue is that if SMBJoin and MapJoin operators are in the same tree, they trigger some code in SparkReduceSinkMapJoinProc and GenSparkWork that corrupts the graph. In particular, SparkReduceSinkMapJoinProc assumes that you only visit a MapJoin op once from a non-RS path (big-table), but this becomes false if the big-table is a child of SMBJoin, as that itself has multiple non-RS parents. The additional fix is to make sure we walk down only once from SMBJoinOp, from only the big-table path of the SMBJoin. Thus we skip further walking if it's a small-table, as anyway no further processing is necessary. There was some code in GenSparkWork that anyway skipped processing for small-table paths of the SMBJoin, now its unnecessary and removed. Also, removed some unnecessary code in SparkReduceSinkMapJoinProc that was triggered to corrupt the graph. Bugs: HIVE-8639 https://issues.apache.org/jira/browse/HIVE-8639 Repository: hive-git Description ------- In MapReduce for auto-SMB joins, SortedMergeJoinProc is run in the earlier Optimizer layer to convert join to SMB join, and SortMergeJoinResolver is run in later PhysicalOptimizer layer to convert it to MapJoin. For Spark, we have an opportunity to make it cleaner by deciding putting both SMB and MapJoin conversions in the logical layer and deciding which one to call. This patch introduces a new unitied join processor called 'SparkJoinOptimizer' in the logical layer. This will call 'SparkMapJoinOptimizer' and 'SparkSortMergeJoinOptimizer' in a certain order depending on the flags that are set and which ever one is available fails. Thus no need to write a SMB -> MapJoin path. 'SparkSortMergeJoinOptimizer' is a new class that wraps the logic of SortedMergeJoinProc but for Spark. To put both MapJoin/SMB processor in the same level, I had to do some fixes. 1. One fix is in 'NonBlockingOpDeDupProc', to fix the join context state, as now its run before the SMB code that relies on it. For this I submitted a trunk patch at HIVE-9060. 2. The second fix is that MapReduce's SMB code did two graph walks, one processor to calculate all 'rejected' joins, and another processor to change the non-rejected ones to SMB join. That would have made it so we do multiple walks, so I refactored the 'rejected' join logic in the same join-operator visit in SparkSortMergeJoinOptimizer. Diffs (updated) ----- ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java c2e643d ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 680c6fd ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java 83625ef ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 5e432ac ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java b6a7ac2 ql/src/test/results/clientpositive/spark/auto_join32.q.out 28c022e ql/src/test/results/clientpositive/spark/auto_join_stats.q.out bccd246 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out 2e35c66 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out b2e928f ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 20ee657 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 0a48d00 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out 5008a3f ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out 3b081af ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out 2a11fb2 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 0d971d2 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out 9d455dc ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out 61eb6ae ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out 198d50d ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out f59e57f ql/src/test/results/clientpositive/spark/parquet_join.q.out 240989a ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 268ae23 ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out df66cc2 Diff: https://reviews.apache.org/r/28930/diff/ Testing ------- Most of the auto-smb tests give the same output with this change, the only difference is now some SMB joins become MapJoin if "hive.auto.convert.sortmerge.join.to.mapjoin" is on, as expected. One failing test is auto_sortmerge_join_9. This was passing until yesterday when bucket-map join is enabled in HIVE-8638. As expected, by choosing MapJoins over SMB join if "hive.auto.convert.sortmerge.join.to.mapjoin" is on, the MapJoin may become a bucket-mapjoin. Some of the more complicated queries of auto_sortmerge_join_9 get converted to bucket mapjoin and fail. Can probably file a new JIRA to fix this test. Thanks, Szehon Ho