How to force sorted merge join to broadcast join

zhangliyun Sun, 28 Jul 2019 22:12:49 -0700

Hi all:
   i want to ask a question about   broadcast join in spark sql.




```
   select A.*,B.nsf_cards_ratio * 1.00 / A.nsf_on_entry as nsf_ratio_to_pop
from B
left join A
on trim(A.country) = trim(B.cntry_code);
```
here A is a small table only 8 rows, but somehow the statistics of table A has 
problem.


A join B is sort merged join while the join key ( trim(A.country) = 
trim(B.cntry_code)) only
has serveral values( neary 21 countries).  is there any way i force spark sql 
to use 
broadcast join (I can not use enlarge the spark.sql.autoBroadcastJoinThreshold  
as i did not know the detail size of spark sql deal with it ).


I tried to print the physical plan , but it did not show the table size and i 
did not know 
how to enlarge the value of spark.sql.autoBroadcastJoinThreshold to force the 
sort merge join to
broadcast join.  




```
== Parsed Logical Plan ==
'Project [ArrayBuffer(cc_base_part1).*, (('cc_base_part1.nsf_cards_ratio * 
1.00) / 'cc_rank_agg.nsf_on_entry) AS nsf_ratio_to_pop#369]
+- 'Join LeftOuter, ('trim('cc_base_part1.country) = 
'trim('cc_rank_agg.cntry_code))
   :- 'UnresolvedRelation `cc_base_part1`
   +- 'UnresolvedRelation `cc_rank_agg`


== Analyzed Logical Plan ==
cust_id: string, country: string, cc_id: decimal(38,0), bin_hmac: string, 
credit_card_created_date: string, card_usage: smallint, cc_category: string, 
cc_size: string, nsf_risk: string, nsf_cards_ratio: decimal(18,2), dt: string, 
nsf_ratio_to_pop: decimal(38,6)
Project [cust_id#372, country#373, cc_id#374, bin_hmac#375, 
credit_card_created_date#376, card_usage#377, cc_category#378, cc_size#379, 
nsf_risk#380, nsf_cards_ratio#381, dt#382, 
CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(nsf_cards_ratio#381
 as decimal(18,2))) * promote_precision(cast(1.00 as decimal(18,2)))), 
DecimalType(22,4)) as decimal(38,16))) / 
promote_precision(cast(nsf_on_entry#386 as decimal(38,16)))), 
DecimalType(38,6)) AS nsf_ratio_to_pop#369]
+- Join LeftOuter, (trim(country#373, None) = trim(cntry_code#383, None))
   :- SubqueryAlias cc_base_part1
   :  +- HiveTableRelation `fpv365h`.`cc_base_part1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cust_id#372, country#373, 
cc_id#374, bin_hmac#375, credit_card_created_date#376, card_usage#377, 
cc_category#378, cc_size#379, nsf_risk#380, nsf_cards_ratio#381], [dt#382]
   +- SubqueryAlias cc_rank_agg
      +- HiveTableRelation `fpv365h`.`cc_rank_agg`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [cntry_code#383, 
num_tot_cards#384L, num_nsf_cards#385L, nsf_on_entry#386], [dt#387]






```


Does spark have any command to show the table size  when printing the physical 
plan ?   Appreciate if you can help my question.




Best regards


Kelly Zhang

How to force sorted merge join to broadcast join

Reply via email to