Re: How to do map join in Spark SQL

2015-12-20 Thread Alexander Pivovarov
spark.sql.autoBroadcastJoinThreshold default value in 1.5.2 is 10MB According to the output in console Spark is doing broadcast, but query which looks like the following does not perform well select big_t.*, small_t.name range_name from big_t join small_t on (1=1) where small_t.min <= big_t.v an

Re: How to do map join in Spark SQL

2015-12-20 Thread Chris Fregly
this type of broadcast should be handled by Spark SQL/DataFrames automatically. this is the primary cost-based, physical-plan query optimization that the Spark SQL Catalyst optimizer supports. in Spark 1.5 and before, you can trigger this optimization by properly setting the spark.sql.autobroad

Re: How to do map join in Spark SQL

2015-12-19 Thread Alexander Pivovarov
I collected small DF to array of tuple3 Then I registered UDF with function which is doing lookup in the array Then I just run select which uses the UDF. On Dec 18, 2015 1:06 AM, "Akhil Das" wrote: > You can broadcast your json data and then do a map side join. This article > is a good start http

Re: How to do map join in Spark SQL

2015-12-18 Thread Akhil Das
You can broadcast your json data and then do a map side join. This article is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/ Thanks Best Regards On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov wrote: > I have big folder having ORC files. Files have duration field (e.g