Nope. Count action did not help to choose broadcast join.
All of my tables are hive external tables. So, I tried to trigger compute
statistics from sqlContext.sql. It gives me an error saying “nonsuch table”. I
am not sure that is due to following bug in 1.4.1
https://issues.apache.org/jira/br
Try doing a count on both lookups to force the caching to occur before the join.
On 8/17/15, 12:39 PM, "VIJAYAKUMAR JAWAHARLAL" wrote:
>Thanks for your help
>
>I tried to cache the lookup tables and left out join with the big table (DF).
>Join does not seem to be using broadcast join-still i
Thanks for your help
I tried to cache the lookup tables and left out join with the big table (DF).
Join does not seem to be using broadcast join-still it goes with hash partition
join and shuffling big table. Here is the scenario
…
table1 as big_df
left outer join
table2 as lkup
on big_df.lkup
In spark 1.4 there is a parameter to control that. Its default value is 10
M. So you need to cache your dataframe to hint the size.
On Aug 14, 2015 7:09 PM, "VIJAYAKUMAR JAWAHARLAL"
wrote:
> Hi
>
> I am facing huge performance problem when I am trying to left outer join
> very big data set (~140G
You could cache the lookup DataFrames, it’ll then do a broadcast join.
On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" wrote:
>Hi
>
>I am facing huge performance problem when I am trying to left outer join very
>big data set (~140GB) with bunch of small lookups [Start schema type]. I am
>usin
Hi
I am facing huge performance problem when I am trying to left outer join very
big data set (~140GB) with bunch of small lookups [Start schema type]. I am
using data frame in spark sql. It looks like data is shuffled and skewed when
that join happens. Is there any way to improve performance