[ https://issues.apache.org/jira/browse/HIVE-16046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885458#comment-15885458 ]
Rui Li commented on HIVE-16046: ------------------------------- Details why we didn't choose broadcast for map join can be found in HIVE-7613. But I agree we may want to revisit this. > Broadcasting small table for Hive on Spark > ------------------------------------------ > > Key: HIVE-16046 > URL: https://issues.apache.org/jira/browse/HIVE-16046 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > > currently the spark plan is > {code} > 1. TS(Small table)->Sel/Fil->HashTableSink > > 2. TS(Small table)->Sel/Fil->HashTableSink > > > 3. HashTableDummy -- > | > HashTableDummy -- > | > RootTS(Big table) ->Sel/Fil ->MapJoin > -->Sel/Fil ->FileSink > {code} > 1. Run the smalltable SparkWorks on Spark cluster, which dump to > hashmap file > 2. Run the SparkWork for the big table on Spark cluster. Mappers > will lookup the smalltable hashmap from the file using HashTableDummy’s > loader. > The disadvantage of current implementation is it need long time to distribute > cache the hash table if the hash table is large. Here want to use > sparkContext.broadcast() to store small table although it will keep the > broadcast variable in driver and bring some performance decline on driver. > [~Fred], [~xuefuz], [~lirui] and [~csun], please give some suggestions on it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)