Due to other reasons we are using spark sql, not dataframe api. I saw that broadcast hint is only available on dataframe api. On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote:
> Can you use the broadcast hint? > > e.g. > > df1.join(broadcast(df2)) > > the broadcast function is in org.apache.spark.sql.functions > > > > On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel <charm...@gmail.com> wrote: > >> Hi, >> >> If I have a hive table, analyze table compute statistics will ensure >> Spark SQL has statistics of that table. When I have a dataframe, is there a >> way to force spark to collect statistics? >> >> I have a large lookup file and I am trying to avoid a broadcast join by >> applying a filter before hand. This filtered RDD does not have statistics >> and so catalyst does not force a broadcast join. Unfortunately I have to >> use spark sql and cannot use dataframe api so cannot give a broadcast hint >> in the join. >> >> Example is this - >> If filtered RDD is saved as a table and compute stats is run, statistics >> are >> >> test.queryExecution.analyzed.statistics >> org.apache.spark.sql.catalyst.plans.logical.Statistics = >> Statistics(38851747) >> >> >> filtered RDD as is gives >> org.apache.spark.sql.catalyst.plans.logical.Statistics = >> Statistics(58403444019505585) >> >> filtered RDD forced to be materialized (cache/count), causes a different >> issue. Executors goes in a deadlock type state where not a single thread >> runs - for hours. I suspect cache a dataframe + broadcast join on same >> dataframe does this. As soon as cache is removed, the job moves forward. >> >> If there was a way for me to force statistics collection without caching >> a dataframe so Spark SQL would use it in a broadcast join? >> >> Thanks, >> Charmee >> > >