[ https://issues.apache.org/jira/browse/HIVE-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364718#comment-14364718 ]
Xin Hao commented on HIVE-9697: ------------------------------- Could we consider to use rawDataSize by default (should be safer for most scenarios), and add a true/false hive parameter flag so that user could choose to use totalSize on demand? > Hive on Spark is not as aggressive as MR on map join [Spark Branch] > ------------------------------------------------------------------- > > Key: HIVE-9697 > URL: https://issues.apache.org/jira/browse/HIVE-9697 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xin Hao > > We have a finding during running some Big-Bench cases: > when the same small table size threshold is used, Map Join operator will not > be generated in Stage Plans for Hive on Spark, while will be generated for > Hive on MR. > For example, When we run BigBench Q25, the meta info of one input ORC table > is as below: > totalSize=1748955 (about 1.5M) > rawDataSize=123050375 (about 120M) > If we use the following parameter settings, > set hive.auto.convert.join=true; > set hive.mapjoin.smalltable.filesize=25000000; > set hive.auto.convert.join.noconditionaltask=true; > set hive.auto.convert.join.noconditionaltask.size=100000000; (100M) > Map Join will be enabled for Hive on MR mode, while will not be enabled for > Hive on Spark. > We found that for Hive on MR, the HDFS file size for the table > (ContentSummary.getLength(), should approximate the value of ‘totalSize’) > will be used to compare with the threshold 100M (smaller than 100M), while > for Hive on Spark 'rawDataSize' will be used to compare with the threshold > 100M (larger than 100M). That's why MapJoin is not enabled for Hive on Spark > for this case. And as a result Hive on Spark will get much lower performance > data than Hive on MR for this case. > When we set hive.auto.convert.join.noconditionaltask.size=150000000; (150M), > MapJoin will be enabled for Hive on Spark mode, and Hive on Spark will have > similar performance data with Hive on MR by then. -- This message was sent by Atlassian JIRA (v6.3.4#6332)