[ https://issues.apache.org/jira/browse/HIVE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138560#comment-14138560 ]
Xuefu Zhang commented on HIVE-7613: ----------------------------------- Here is what I have in mind: 1. For N-way join being converting to a map join, we can run N-1 Spark jobs, one for each small input to the join (assuming transformation is needed. If not, then we don't need a Spark job). Each job generates some RDD at the end, so we have N-1 RDDs in the end. 2. Dump the content of RDDs into the data structure (hash tables) that's needed by MapJoinOperator. 3. Call SparkContext.broadcast() on that data structure. This will broadcast the data struture to all nodes. 4. Then, we can launch the map only, join job, which can load the broadcasted data structure via the HashTableLoader interface. For more information about Spark's broadcast variable, please refer to http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables. > Research optimization of auto convert join to map join [Spark branch] > --------------------------------------------------------------------- > > Key: HIVE-7613 > URL: https://issues.apache.org/jira/browse/HIVE-7613 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Chengxiang Li > Assignee: Suhas Satish > Priority: Minor > Attachments: HIve on Spark Map join background.docx > > > ConvertJoinMapJoin is an optimization the replaces a common join(aka shuffle > join) with a map join(aka broadcast or fragment replicate join) when > possible. we need to research how to make it workable with Hive on Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)