Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN.
Currently, for the non-equal join, if the join type is the INNER join, then it will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the outer joins. In the BroadcastnestedLoopJoin, the table with smaller estimate size will be broadcasted, but if the smaller table is also a huge table, I don’t think Spark SQL can handle that right now (OOM). So, I am not sure how you created the df1 instance, but we’d better to reflect the real size for the statistics of it, and let the framework decide what to do, hopefully Spark Sql can support the non-equal join for large tables in the next release. Hao From: gen tang [mailto:gen.tan...@gmail.com] Sent: Tuesday, August 11, 2015 10:12 PM To: dev@spark.apache.org Subject: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold Hi, Recently, I use spark sql to do join on non-equality condition, condition1 or condition2 for example. Spark will use broadcastNestedLoopJoin to do this. Assume that one of dataframe(df1) is not created from hive table nor local collection and the other one is created from hivetable(df2). For df1, spark will use defaultSizeInBytes * length of df1 to estimate the size of df1 and use correct size for df2. As the result, in most cases, spark will think df1 is bigger than df2 even df2 is really huge. And spark will do df2.collect(), which will cause error or slowness of program. Maybe we should just use defaultSizeInBytes for logicalRDD, not defaultSizeInBytes * length? Hope this could be helpful Thanks a lot in advance for your help and input. Cheers Gen