Re: Broadcasting a parquet file using spark and python

Michael Armbrust Sat, 05 Dec 2015 13:11:46 -0800

I believe we started supporting broadcast outer joins in Spark 1.5.  Which
version are you using?


On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:

> Hi all,
>
>
>
> Sorry to re-open this thread.
>
>
>
> I have a similar issue, one big parquet file left outer join quite a few
> smaller parquet files. But the running is extremely slow and even OOM
> sometimes (with 300M , I have two questions here:
>
>
>
> 1, If I use outer join, will Spark SQL auto use broadcast hashjoin?
>
> 2, If not, in the latest documents:
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
>
>
> spark.sql.autoBroadcastJoinThreshold
>
> 10485760 (10 MB)
>
> Configures the maximum size in bytes for a table that will be broadcast to
> all worker nodes when performing a join. By setting this value to -1
> broadcasting can be disabled. Note that currently statistics are only
> supported for Hive Metastore tables where the command ANALYZE TABLE
> <tableName> COMPUTE STATISTICS noscan has been run.
>
>
>
> How can I do this (run command analyze table) in Java? I know I can code
> it by myself (create a broadcast val and implement lookup by myself), but
> it will make code super ugly.
>
>
>
> I hope we can have either API or hint to enforce the hashjoin (instead of
> this suspicious autoBroadcastJoinThreshold parameter). Do we have any
> ticket or roadmap for this feature?
>
>
>
> Regards,
>
>
>
> Shuai
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Wednesday, April 01, 2015 2:01 PM
> *To:* Jitesh chandra Mishra
> *Cc:* user
> *Subject:* Re: Broadcasting a parquet file using spark and python
>
>
>
> You will need to create a hive parquet table that points to the data and
> run "ANALYZE TABLE tableName noscan" so that we have statistics on the size.
>
>
>
> On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <
> jitesh...@gmail.com> wrote:
>
> Hi Michael,
>
>
>
> Thanks for your response. I am running 1.2.1.
>
>
>
> Is there any workaround to achieve the same with 1.2.1?
>
>
>
> Thanks,
>
> Jitesh
>
>
>
> On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> In Spark 1.3 I would expect this to happen automatically when the parquet
> table is small (< 10mb, configurable with 
> spark.sql.autoBroadcastJoinThreshold).
> If you are running 1.3 and not seeing this, can you show the code you are
> using to create the table?
>
>
>
> On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote:
>
> How can we implement a BroadcastHashJoin for spark with python?
>
> My SparkSQL inner joins are taking a lot of time since it is performing
> ShuffledHashJoin.
>
> Tables on which join is performed are stored as parquet files.
>
> Please help.
>
> Thanks and regards,
> Jitesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>

Re: Broadcasting a parquet file using spark and python

Reply via email to