hi, I think that people have reported the same issue elsewhere, and this should be registered as a bug in SPARK
https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html Regards, Gourav On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote: > Hi Ted, > > The self join works fine on tbales where the hivecontext tables are direct > hive tables, therefore > > table1 = hiveContext.sql("select columnA, columnB from hivetable1") > table1.registerTempTable("table1") > table1.cache() > table1.count() > > and if I do a self join on table1 things are quite fine > > But in case we have something like this: > table1 = hiveContext.sql("select columnA, columnB from hivetable1") > table1.registerTempTable("table1") > table1.cache() > table1.count() > > table2 = hiveContext.sql("select columnA, columnB from hivetable2") > table2.registerTempTable("table2") > table2.cache() > table2.count() > > table3 = hiveContext.sql("select table1.* from table1 table2 where > table1.columnA = table2.columnA") > table3.registerTempTable("table3") > table3.cache() > table3.count() > > > then the self join on table3 does not take data from table3 cache, neither > from table1 or table2 cache but starts taking data directly from S3 - which > as you would understand does not make any sense. > > > Regards, > Gourav > > > > > > On Wed, Dec 16, 2015 at 7:16 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> I did the following exercise in spark-shell ("c" is cached table): >> >> scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain >> == Physical Plan == >> Project [b#4] >> +- BroadcastHashJoin [a#3], [a#125], BuildRight >> :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation >> [a#3,b#4,c#5], true, 10000, StorageLevel(true, true, false, true, 1), >> ConvertToUnsafe, Some(c) >> +- InMemoryColumnarTableScan [a#125], InMemoryRelation >> [a#125,b#126,c#127], true, 10000, StorageLevel(true, true, false, true, 1), >> ConvertToUnsafe, Some(c) >> >> sqlContext.sql("select x.b, y.c from c x join c y on x.a = >> y.a").registerTempTable("d") >> scala> sqlContext.cacheTable("d") >> >> scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain >> == Physical Plan == >> Project [b#4] >> +- SortMergeJoin [c#90], [c#253] >> :- Sort [c#90 ASC], false, 0 >> : +- TungstenExchange hashpartitioning(c#90,200), None >> : +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation >> [b#4,c#90], true, 10000, StorageLevel(true, true, false, true, 1), Project >> [b#4,c#90], Some(d) >> +- Sort [c#253 ASC], false, 0 >> +- TungstenExchange hashpartitioning(c#253,200), None >> +- InMemoryColumnarTableScan [c#253], InMemoryRelation >> [b#246,c#253], true, 10000, StorageLevel(true, true, false, true, 1), >> Project [b#4,c#90], Some(d) >> >> Is the above what you observed ? >> >> Cheers >> >> On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Hi, >>> >>> This is how the data can be created: >>> >>> 1. TableA : cached() >>> 2. TableB : cached() >>> 3. TableC: TableA inner join TableB cached() >>> 4. TableC join TableC does not take the data from cache but starts >>> reading the data for TableA and TableB from disk. >>> >>> Does this sound like a bug? The self join between TableA and TableB are >>> working fine and taking data from cache. >>> >>> >>> Regards, >>> Gourav >>> >> >> >