Re: How to force statistics calculation of Dataframe?

Reynold Xin Wed, 04 Nov 2015 15:50:35 -0800

Can you use the broadcast hint?

e.g.


df1.join(broadcast(df2))

the broadcast function is in org.apache.spark.sql.functions



On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel <[email protected]> wrote:

> Hi,
>
> If I have a hive table, analyze table compute statistics will ensure Spark
> SQL has statistics of that table. When I have a dataframe, is there a way
> to force spark to collect statistics?
>
> I have a large lookup file and I am trying to avoid a broadcast join by
> applying a filter before hand. This filtered RDD does not have statistics
> and so catalyst does not force a broadcast join. Unfortunately I have to
> use spark sql and cannot use dataframe api so cannot give a broadcast hint
> in the join.
>
> Example is this -
> If filtered RDD is saved as a table and compute stats is run, statistics
> are
>
> test.queryExecution.analyzed.statistics
> org.apache.spark.sql.catalyst.plans.logical.Statistics =
> Statistics(38851747)
>
>
> filtered RDD as is gives
> org.apache.spark.sql.catalyst.plans.logical.Statistics =
> Statistics(58403444019505585)
>
> filtered RDD forced to be materialized (cache/count), causes a different
> issue. Executors goes in a deadlock type state where not a single thread
> runs - for hours. I suspect cache a dataframe + broadcast join on same
> dataframe does this. As soon as cache is removed, the job moves forward.
>
> If there was a way for me to force statistics collection without caching a
> dataframe so Spark SQL would use it in a broadcast join?
>
> Thanks,
> Charmee
>

Re: How to force statistics calculation of Dataframe?

Reply via email to