Hi Ted, The self join works fine on tbales where the hivecontext tables are direct hive tables, therefore
table1 = hiveContext.sql("select columnA, columnB from hivetable1") table1.registerTempTable("table1") table1.cache() table1.count() and if I do a self join on table1 things are quite fine But in case we have something like this: table1 = hiveContext.sql("select columnA, columnB from hivetable1") table1.registerTempTable("table1") table1.cache() table1.count() table2 = hiveContext.sql("select columnA, columnB from hivetable2") table2.registerTempTable("table2") table2.cache() table2.count() table3 = hiveContext.sql("select table1.* from table1 table2 where table1.columnA = table2.columnA") table3.registerTempTable("table3") table3.cache() table3.count() then the self join on table3 does not take data from table3 cache, neither from table1 or table2 cache but starts taking data directly from S3 - which as you would understand does not make any sense. Regards, Gourav On Wed, Dec 16, 2015 at 7:16 PM, Ted Yu <yuzhih...@gmail.com> wrote: > I did the following exercise in spark-shell ("c" is cached table): > > scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain > == Physical Plan == > Project [b#4] > +- BroadcastHashJoin [a#3], [a#125], BuildRight > :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation [a#3,b#4,c#5], > true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, > Some(c) > +- InMemoryColumnarTableScan [a#125], InMemoryRelation > [a#125,b#126,c#127], true, 10000, StorageLevel(true, true, false, true, 1), > ConvertToUnsafe, Some(c) > > sqlContext.sql("select x.b, y.c from c x join c y on x.a = > y.a").registerTempTable("d") > scala> sqlContext.cacheTable("d") > > scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain > == Physical Plan == > Project [b#4] > +- SortMergeJoin [c#90], [c#253] > :- Sort [c#90 ASC], false, 0 > : +- TungstenExchange hashpartitioning(c#90,200), None > : +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation > [b#4,c#90], true, 10000, StorageLevel(true, true, false, true, 1), Project > [b#4,c#90], Some(d) > +- Sort [c#253 ASC], false, 0 > +- TungstenExchange hashpartitioning(c#253,200), None > +- InMemoryColumnarTableScan [c#253], InMemoryRelation > [b#246,c#253], true, 10000, StorageLevel(true, true, false, true, 1), > Project [b#4,c#90], Some(d) > > Is the above what you observed ? > > Cheers > > On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> This is how the data can be created: >> >> 1. TableA : cached() >> 2. TableB : cached() >> 3. TableC: TableA inner join TableB cached() >> 4. TableC join TableC does not take the data from cache but starts >> reading the data for TableA and TableB from disk. >> >> Does this sound like a bug? The self join between TableA and TableB are >> working fine and taking data from cache. >> >> >> Regards, >> Gourav >> > >