Hi, I hope this answers your question. You can hint the broadcast in SQL as detailed here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html (thanks Jacek :) ) I'd recommend creating a temporary table with the trimming you use in the join (for clarity). Also keep in mind using the methods is more powerful/readable than using Spark SQL directly (as happens with the broadcast case, although it depends on personal preference).
Regards, Ruben -- Rubén Berenguel On 29 July 2019 at 07:12:30, zhangliyun (kelly...@126.com) wrote: Hi all: i want to ask a question about broadcast join in spark sql. ``` select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop from B left join A on trim(A.country) = trim(B.cntry_code); ``` here A is a small table only 8 rows, but somehow the statistics of table A has problem. A join B is sort merged join while the join key ( trim(A.country) = trim(B.cntry_code)) only has serveral values( neary 21 countries). is there any way i force spark sql to use broadcast join (I can not use enlarge the spark.sql.autoBroadcastJoinThreshold as i did not know the detail size of spark sql deal with it ). I tried to print the physical plan , but it did not show the table size and i did not know how to enlarge the value of spark.sql.autoBroadcastJoinThreshold to force the sort merge join to broadcast join. ``` == Parsed Logical Plan == 'Project [ArrayBuffer(cc_base_part1).*, (('cc_base_part1.nsf_cards_ratio * 1.00) / 'cc_rank_agg.nsf_on_entry) AS nsf_ratio_to_pop#369] +- 'Join LeftOuter, ('trim('cc_base_part1.country) = 'trim('cc_rank_agg.cntry_code)) :- 'UnresolvedRelation `cc_base_part1` +- 'UnresolvedRelation `cc_rank_agg` == Analyzed Logical Plan == cust_id: string, country: string, cc_id: decimal(38,0), bin_hmac: string, credit_card_created_date: string, card_usage: smallint, cc_category: string, cc_size: string, nsf_risk: string, nsf_cards_ratio: decimal(18,2), dt: string, nsf_ratio_to_pop: decimal(38,6) Project [cust_id#372, country#373, cc_id#374, bin_hmac#375, credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379, nsf_risk#380, nsf_cards_ratio#381, dt#382, CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(nsf_cards_ratio#381 as decimal(18,2))) * promote_precision(cast(1.00 as decimal(18,2)))), DecimalType(22,4)) as decimal(38,16))) / promote_precision(cast(nsf_on_entry#386 as decimal(38,16)))), DecimalType(38,6)) AS nsf_ratio_to_pop#369] +- Join LeftOuter, (trim(country#373, None) = trim(cntry_code#383, None)) :- SubqueryAlias cc_base_part1 : +- HiveTableRelation `fpv365h`.`cc_base_part1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cust_id#372, country#373, cc_id#374, bin_hmac#375, credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379, nsf_risk#380, nsf_cards_ratio#381], [dt#382] +- SubqueryAlias cc_rank_agg +- HiveTableRelation `fpv365h`.`cc_rank_agg`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cntry_code#383, num_tot_cards#384L, num_nsf_cards#385L, nsf_on_entry#386], [dt#387] ``` Does spark have any command to show the table size when printing the physical plan ? Appreciate if you can help my question. Best regards Kelly Zhang