Re: Left outer joining big data set with small lookups

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Nope. Count action did not help to choose broadcast join. All of my tables are hive external tables. So, I tried to trigger compute statistics from sqlContext.sql. It gives me an error saying “nonsuch table”. I am not sure that is due to following bug in 1.4.1 https://issues.apache.org/jira/br

Re: Left outer joining big data set with small lookups

2015-08-17 Thread Silvio Fiorito
Try doing a count on both lookups to force the caching to occur before the join. On 8/17/15, 12:39 PM, "VIJAYAKUMAR JAWAHARLAL" wrote: >Thanks for your help > >I tried to cache the lookup tables and left out join with the big table (DF). >Join does not seem to be using broadcast join-still i

Re: Left outer joining big data set with small lookups

2015-08-17 Thread VIJAYAKUMAR JAWAHARLAL
Thanks for your help I tried to cache the lookup tables and left out join with the big table (DF). Join does not seem to be using broadcast join-still it goes with hash partition join and shuffling big table. Here is the scenario … table1 as big_df left outer join table2 as lkup on big_df.lkup

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Raghavendra Pandey
In spark 1.4 there is a parameter to control that. Its default value is 10 M. So you need to cache your dataframe to hint the size. On Aug 14, 2015 7:09 PM, "VIJAYAKUMAR JAWAHARLAL" wrote: > Hi > > I am facing huge performance problem when I am trying to left outer join > very big data set (~140G

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Silvio Fiorito
You could cache the lookup DataFrames, it’ll then do a broadcast join. On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" wrote: >Hi > >I am facing huge performance problem when I am trying to left outer join very >big data set (~140GB) with bunch of small lookups [Start schema type]. I am >usin