spark.sql.autoBroadcastJoinThreshold default value in 1.5.2 is 10MB
According to the output in console Spark is doing broadcast, but query
which looks like the following does not perform well
select
big_t.*,
small_t.name range_name
from big_t
join small_t on (1=1)
where small_t.min <= big_t.v an
this type of broadcast should be handled by Spark SQL/DataFrames automatically.
this is the primary cost-based, physical-plan query optimization that the Spark
SQL Catalyst optimizer supports.
in Spark 1.5 and before, you can trigger this optimization by properly setting
the spark.sql.autobroad
I collected small DF to array of tuple3
Then I registered UDF with function which is doing lookup in the array
Then I just run select which uses the UDF.
On Dec 18, 2015 1:06 AM, "Akhil Das" wrote:
> You can broadcast your json data and then do a map side join. This article
> is a good start http
You can broadcast your json data and then do a map side join. This article
is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
Thanks
Best Regards
On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov
wrote:
> I have big folder having ORC files. Files have duration field (e.g