I believe we started supporting broadcast outer joins in Spark 1.5. Which version are you using?
On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng <szheng.c...@gmail.com> wrote: > Hi all, > > > > Sorry to re-open this thread. > > > > I have a similar issue, one big parquet file left outer join quite a few > smaller parquet files. But the running is extremely slow and even OOM > sometimes (with 300M , I have two questions here: > > > > 1, If I use outer join, will Spark SQL auto use broadcast hashjoin? > > 2, If not, in the latest documents: > http://spark.apache.org/docs/latest/sql-programming-guide.html > > > > spark.sql.autoBroadcastJoinThreshold > > 10485760 (10 MB) > > Configures the maximum size in bytes for a table that will be broadcast to > all worker nodes when performing a join. By setting this value to -1 > broadcasting can be disabled. Note that currently statistics are only > supported for Hive Metastore tables where the command ANALYZE TABLE > <tableName> COMPUTE STATISTICS noscan has been run. > > > > How can I do this (run command analyze table) in Java? I know I can code > it by myself (create a broadcast val and implement lookup by myself), but > it will make code super ugly. > > > > I hope we can have either API or hint to enforce the hashjoin (instead of > this suspicious autoBroadcastJoinThreshold parameter). Do we have any > ticket or roadmap for this feature? > > > > Regards, > > > > Shuai > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Wednesday, April 01, 2015 2:01 PM > *To:* Jitesh chandra Mishra > *Cc:* user > *Subject:* Re: Broadcasting a parquet file using spark and python > > > > You will need to create a hive parquet table that points to the data and > run "ANALYZE TABLE tableName noscan" so that we have statistics on the size. > > > > On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra < > jitesh...@gmail.com> wrote: > > Hi Michael, > > > > Thanks for your response. I am running 1.2.1. > > > > Is there any workaround to achieve the same with 1.2.1? > > > > Thanks, > > Jitesh > > > > On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com> > wrote: > > In Spark 1.3 I would expect this to happen automatically when the parquet > table is small (< 10mb, configurable with > spark.sql.autoBroadcastJoinThreshold). > If you are running 1.3 and not seeing this, can you show the code you are > using to create the table? > > > > On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote: > > How can we implement a BroadcastHashJoin for spark with python? > > My SparkSQL inner joins are taking a lot of time since it is performing > ShuffledHashJoin. > > Tables on which join is performed are stored as parquet files. > > Please help. > > Thanks and regards, > Jitesh > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > >