This is my code as below:

cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()


rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (4, 'c')], ['idx', 'val'])
rdd1.registerTempTable('rdd1')
rdd2 = spark.createDataFrame([(1, 2, 100), (1, 3, 200), (2, 3, 300)], ['key1', 
'key2', 'val'])
rdd2.registerTempTable('rdd2')


what_i_want = spark.sql("""
select
*
from rdd2 a
left outer join rdd1 b
on a.key1 = b.idx
left outer join rdd1 c
on a.key2 = c.idx
""")
what_i_want.show()


try_to_use_API = rdd2.join(rdd1, on=[rdd2['key1'] == rdd1['idx']], 
how='left_outer') \
    .join(rdd1, on=[rdd2['key2'] == rdd1['idx']], how='left_outer')
try_to_use_API.show()


But try_to_use_API does not work as well and rais error:
pyspark.sql.utils.AnalysisException: u'Both sides of this join are outside the 
broadcasting threshold and computing it could be prohibitively expensive. To 
explicitly enable it, please set spark.sql.crossJoin.enabled = true;'
How can I fix this 
Thanks  

Reply via email to