hm. Sounds like it ends up in Nested Loop Join (NLJ) as opposed to Hash Join (HJ) when OR is used for more than one predicate comparison.
In below I have a table dummy created as ORC with 1 billion rows. Just created another one called dummy1 with 60K rows A simple join results in Hash Join good! scala> sql("select d.id, d1.id from dummy d, dummy2 d1 where d.random_string = d1.random_string").explain(true) == Physical Plan == Project [id#212,id#219] *+- BroadcastHashJoin [random_string#216], [random_string#223], BuildRight* :- ConvertToUnsafe : +- HiveTableScan [id#212,random_string#216], MetastoreRelation test, dummy, Some(d) +- ConvertToUnsafe +- HiveTableScan [id#219,random_string#223], MetastoreRelation test, dummy2, Some(d1) When the join is done using OR on other predicates I see it starts doing NLJ scala> sql("select d.id, d1.id from dummy d, dummy2 d1 where d.random_string = d1.random_string OR d.small_vc = d1.small_vc").explain(true) == Physical Plan == Project [id#241,id#248] +- B*roadcastNestedLoopJoin *BuildRight, Inner, Some(((random_string#245 = random_string#252) || (small_vc#246 = small_vc#253))) :- HiveTableScan [small_vc#246,id#241,random_string#245], MetastoreRelation test, dummy, Some(d) +- HiveTableScan [id#248,random_string#252,small_vc#253], MetastoreRelation test, dummy2, Some(d1) in contrast the same identical tables in Oracle use Hash Join with OR which is expected scratch...@mydb.mich.LOCAL> select d.id, d1.id from dummy d, dummy2 d1 where d.random_string = d1.random_string OR d.small_vc = d1.small_vc; Execution Plan ---------------------------------------------------------- Plan hash value: 4163534687 -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 63207 | 8332K| | 1280K (1)| 04:16:05 | | 1 | CONCATENATION | | | | | | | |* 2 | * HASH JOIN *| | 60183 | 7934K| 4632K| 640K (1)| 02:08:03 | | 3 | TABLE ACCESS FULL| DUMMY2 | 60000 | 3925K| | 157 (1)| 00:00:02 | | 4 | TABLE ACCESS FULL| DUMMY | 100M| 6484M| | 261K (1)| 00:52:13 | |* 5 | *HASH JOIN *| | 3024 | 398K| 4632K| 640K (1)| 02:08:03 | | 6 | TABLE ACCESS FULL| DUMMY2 | 60000 | 3925K| | 157 (1)| 00:00:02 | | 7 | TABLE ACCESS FULL| DUMMY | 100M| 6484M| | 261K (1)| 00:52:13 | -------------------------------------------------------------------------------------- So this looks like a bug! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 1 April 2016 at 04:53, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Thanks for the reply everyone. > > Let me provide more detail on the dataset as well. > 1. The big table (A) contains more than 3 billion records in parquet > format, which is few TBs. > 2. The second table (B) is only of 60K rows which is less than 10MB. > 3. The column on which I perform JOIN is mostly on the String datatype > columns. > 4. I used 20 machines that were of 16 cores each and 120GB RAMs for > testing this. > > The pseudo OR query is as below. > > sql(Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR > A.dimension2 = B.dimension2 OR A.dimension3 = B.dimension3 OR A.dimension4 = > B.dimension4).explain(true) > > > Pseudo union query is as below. > > sql{Select field1, field2 from A, B where A.dimension1 = B.dimension1 > UNION ALL > Select field1, field2 from A, B where A.dimension2 = B.dimension2 > UNION ALL > Select field1, field2 from A, B where A.dimension3 = B.dimension3 > UNION ALL > Select field1, field2 from A, B where A.dimension4 = > B.dimension4}.explain(true) > > You can look at the explain plan in the ticket > https://issues.apache.org/jira/browse/SPARK-13900 > > Hemant has mentioned the nested loop time will be very little. But the > time taken by inner query for this kind of join is too long. Union of 3 > HashJoins take only 3 minutes (each dimension hashjoin takes 1 minute), > nested loop join takes nearly 13 mins. I agree that the performance of > HashedJoin on OR condition will be linear but that will be very optimized > comparing to the nested loop join. > > As Yong pointed out, if we can provide hints in SQL engine, it will be > awesome for these kind of cases. > > Any idea on how we can optimize this will be helpful. Please let me know > if any other detail is needed to provide input. > > Regards > Ashok > > On Thu, Mar 31, 2016 at 8:32 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> a hash join come into play when you are joining a large table with >> a small table (large being a relative term)..Hash join by definition works >> on equality condition through hash bucketing. >> >> Can you provide pseudo code for your case please? >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 31 March 2016 at 09:58, ashokkumar rajendran < >> ashokkumar.rajend...@gmail.com> wrote: >> >>> Hi, >>> >>> I have filed ticket SPARK-13900. There was an initial reply from a >>> developer but did not get any reply on this. How can we do multiple hash >>> joins together for OR conditions based joins? Could someone please guide on >>> how can we fix this? >>> >>> Regards >>> Ashok >>> >> >> >