[ https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szehon Ho updated HIVE-8639: ---------------------------- Attachment: HIVE-8639.2-spark.patch Address review comments, update some golden files, and fix another issue. The issue is that if SMBJoin and MapJoin operators are in the same tree, they trigger some code in SparkReduceSinkMapJoinProc and GenSparkWork that corrupts the graph. In particular, those processor had assumed that you only visit a MapJoin op once from a non-RS path (big-table), but this becomes false if the big-table is a child of SMBJoin, as that itself has multiple non-RS parents. The additional fix is to make sure we walk down once from SMBJoinOp, only the big-table path. Thus we skip further walking if it's a small-table, as anyway no further processing is necessary. RB is not working for me at the moment, will upload there once it is. > Convert SMBJoin to MapJoin [Spark Branch] > ----------------------------------------- > > Key: HIVE-8639 > URL: https://issues.apache.org/jira/browse/HIVE-8639 > Project: Hive > Issue Type: Sub-task > Components: Spark > Affects Versions: spark-branch > Reporter: Szehon Ho > Assignee: Szehon Ho > Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch > > > HIVE-8202 supports auto-conversion of SMB Join. However, if the tables are > partitioned, there could be a slow down as each mapper would need to get a > very small chunk of a partition which has a single key. Thus, in some > scenarios it's beneficial to convert SMB join to map join. > The task is to research and support the conversion from SMB join to map join > for Spark execution engine. See the equivalent of MapReduce in > SortMergeJoinResolver. -- This message was sent by Atlassian JIRA (v6.3.4#6332)