SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

David Hodeffi Wed, 21 Dec 2016 04:44:41 -0800

I have two dataframes which I am joining. small and big size dataframess. The 
optimizer suggest to use BroadcastNestedLoopJoin.
number of partitions for the big Dataframe is 200 while small Dataframe has 5 
partitions.
The joined dataframe results with 205 partitions (joined.rdd.partitions.size), 
I have tried to understand why is this number and figured out that 
BroadCastNestedLoopJoin is actually a union.
code :
case class BroadcastNestedLoopJoin{
def doExecuteo(): =
{ ... ... sparkContext.union( matchedStreamRows, 
sparkContext.makeRDD(notMatchedBroadcastRows) ) }
}
can someone please explain what exactly the code of doExecute() do? can you 
elaborate about all the null checks and why can we have nulls ? Why do we have 
205 partitions? link to a JIRA with discussion that can explain the code can 
help.



Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

Reply via email to