Hi,

If I have a hive table, analyze table compute statistics will ensure Spark
SQL has statistics of that table. When I have a dataframe, is there a way
to force spark to collect statistics?

I have a large lookup file and I am trying to avoid a broadcast join by
applying a filter before hand. This filtered RDD does not have statistics
and so catalyst does not force a broadcast join. Unfortunately I have to
use spark sql and cannot use dataframe api so cannot give a broadcast hint
in the join.

Example is this -
If filtered RDD is saved as a table and compute stats is run, statistics are

test.queryExecution.analyzed.statistics
org.apache.spark.sql.catalyst.plans.logical.Statistics =
Statistics(38851747)


filtered RDD as is gives
org.apache.spark.sql.catalyst.plans.logical.Statistics =
Statistics(58403444019505585)

filtered RDD forced to be materialized (cache/count), causes a different
issue. Executors goes in a deadlock type state where not a single thread
runs - for hours. I suspect cache a dataframe + broadcast join on same
dataframe does this. As soon as cache is removed, the job moves forward.

If there was a way for me to force statistics collection without caching a
dataframe so Spark SQL would use it in a broadcast join?

Thanks,
Charmee

Reply via email to