Hi, If I have a hive table, analyze table compute statistics will ensure Spark SQL has statistics of that table. When I have a dataframe, is there a way to force spark to collect statistics?
I have a large lookup file and I am trying to avoid a broadcast join by applying a filter before hand. This filtered RDD does not have statistics and so catalyst does not force a broadcast join. Unfortunately I have to use spark sql and cannot use dataframe api so cannot give a broadcast hint in the join. Example is this - If filtered RDD is saved as a table and compute stats is run, statistics are test.queryExecution.analyzed.statistics org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(38851747) filtered RDD as is gives org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(58403444019505585) filtered RDD forced to be materialized (cache/count), causes a different issue. Executors goes in a deadlock type state where not a single thread runs - for hours. I suspect cache a dataframe + broadcast join on same dataframe does this. As soon as cache is removed, the job moves forward. If there was a way for me to force statistics collection without caching a dataframe so Spark SQL would use it in a broadcast join? Thanks, Charmee