Re: How to force statistics calculation of Dataframe?

Charmee Patel Wed, 04 Nov 2015 15:52:53 -0800

Due to other reasons we are using spark sql, not dataframe api. I saw that
broadcast hint is only available on dataframe api.
On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote:


> Can you use the broadcast hint?
>
> e.g.
>
> df1.join(broadcast(df2))
>
> the broadcast function is in org.apache.spark.sql.functions
>
>
>
> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel <charm...@gmail.com> wrote:
>
>> Hi,
>>
>> If I have a hive table, analyze table compute statistics will ensure
>> Spark SQL has statistics of that table. When I have a dataframe, is there a
>> way to force spark to collect statistics?
>>
>> I have a large lookup file and I am trying to avoid a broadcast join by
>> applying a filter before hand. This filtered RDD does not have statistics
>> and so catalyst does not force a broadcast join. Unfortunately I have to
>> use spark sql and cannot use dataframe api so cannot give a broadcast hint
>> in the join.
>>
>> Example is this -
>> If filtered RDD is saved as a table and compute stats is run, statistics
>> are
>>
>> test.queryExecution.analyzed.statistics
>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>> Statistics(38851747)
>>
>>
>> filtered RDD as is gives
>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>> Statistics(58403444019505585)
>>
>> filtered RDD forced to be materialized (cache/count), causes a different
>> issue. Executors goes in a deadlock type state where not a single thread
>> runs - for hours. I suspect cache a dataframe + broadcast join on same
>> dataframe does this. As soon as cache is removed, the job moves forward.
>>
>> If there was a way for me to force statistics collection without caching
>> a dataframe so Spark SQL would use it in a broadcast join?
>>
>> Thanks,
>> Charmee
>>
>
>

Re: How to force statistics calculation of Dataframe?

Reply via email to