Can you use the broadcast hint? e.g.
df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel <[email protected]> wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has statistics of that table. When I have a dataframe, is there a way > to force spark to collect statistics? > > I have a large lookup file and I am trying to avoid a broadcast join by > applying a filter before hand. This filtered RDD does not have statistics > and so catalyst does not force a broadcast join. Unfortunately I have to > use spark sql and cannot use dataframe api so cannot give a broadcast hint > in the join. > > Example is this - > If filtered RDD is saved as a table and compute stats is run, statistics > are > > test.queryExecution.analyzed.statistics > org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(38851747) > > > filtered RDD as is gives > org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(58403444019505585) > > filtered RDD forced to be materialized (cache/count), causes a different > issue. Executors goes in a deadlock type state where not a single thread > runs - for hours. I suspect cache a dataframe + broadcast join on same > dataframe does this. As soon as cache is removed, the job moves forward. > > If there was a way for me to force statistics collection without caching a > dataframe so Spark SQL would use it in a broadcast join? > > Thanks, > Charmee >
