Hello All,

I'm trying to join 2 dataframes A and B with a

sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a");

Now what I have done is that I have registeredTempTables for A and B after
loading these DataFrames from different sources. I need the join to be
really fast and I was wondering if there is a way to use the SQL statement
and then being able to do a mapper side join ( say my table B is small) ?

I read some articles on using broadcast to do mapper side joins. Could I do
something like this and then execute my sql statement to achieve mapper
side join ?

DataFrame B = sparkContext.broadcast(B);
B.registerTempTable("B");


I have a join as stated above and I see in my executor logs the below :

16/02/29 17:02:35 INFO TaskSetManager: Finished task 198.0 in stage 7.0
(TID 1114) in 20354 ms on localhost (196/200)

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty
blocks out of 200 blocks

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
fetches in 0 ms

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty
blocks out of 128 blocks

16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
fetches in 0 ms

16/02/29 17:03:03 INFO Executor: Finished task 199.0 in stage 7.0 (TID
1115). 2511 bytes result sent to driver

16/02/29 17:03:03 INFO TaskSetManager: Finished task 199.0 in stage 7.0
(TID 1115) in 27621 ms on localhost (197/200)

*16/02/29 17:07:06 INFO UnsafeExternalSorter: Thread 124 spilling sort data
of 256.0 KB to disk (0  time so far)*


Now, I have around 10G of executor memory and my memory faction should be
the default ( 0.75 as per the documentation) and my memory usage is < 1.5G(
obtained from the Storage tab on Spark dashboard), but still it says
spilling sort data. I'm a little surprised why this happens even when I
have enough memory free.

Any inputs will be greatly appreciated!

Thanks
-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com

Reply via email to