So what I discovered was that if I write the table being joined to the disk and then read it again Spark correctly broadcasts it. I think it is because when Spark estimates the size of smaller table it estimates it incorrectly to be much bigger that what it is and hence decides to do a SortMergeJoin on it. Writing it to the disk and then reading it back again gives Spark the correct size and hence it then goes ahead and does a Broadcast join.
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
