[ 
https://issues.apache.org/jira/browse/HIVE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-8639:
----------------------------
    Attachment: HIVE-8639.2-spark.patch

Address review comments, update some golden files, and fix another issue.  The 
issue is that if SMBJoin and MapJoin operators are in the same tree, they 
trigger some code in SparkReduceSinkMapJoinProc and GenSparkWork that corrupts 
the graph.  In particular, those processor had assumed that you only visit a 
MapJoin op once from a non-RS path (big-table), but this becomes false if the 
big-table is a child of SMBJoin, as that itself has multiple non-RS parents.

The additional fix is to make sure we walk down once from SMBJoinOp, only the 
big-table path.  Thus we skip further walking if it's a small-table, as anyway 
no further processing is necessary.

RB is not working for me at the moment, will upload there once it is.

> Convert SMBJoin to MapJoin [Spark Branch]
> -----------------------------------------
>
>                 Key: HIVE-8639
>                 URL: https://issues.apache.org/jira/browse/HIVE-8639
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: spark-branch
>            Reporter: Szehon Ho
>            Assignee: Szehon Ho
>         Attachments: HIVE-8639.1-spark.patch, HIVE-8639.2-spark.patch
>
>
> HIVE-8202 supports auto-conversion of SMB Join.  However, if the tables are 
> partitioned, there could be a slow down as each mapper would need to get a 
> very small chunk of a partition which has a single key. Thus, in some 
> scenarios it's beneficial to convert SMB join to map join.
> The task is to research and support the conversion from SMB join to map join 
> for Spark execution engine.  See the equivalent of MapReduce in 
> SortMergeJoinResolver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to