I did the following exercise in spark-shell ("c" is cached table):
scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain == Physical Plan == Project [b#4] +- BroadcastHashJoin [a#3], [a#125], BuildRight :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation [a#3,b#4,c#5], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, Some(c) +- InMemoryColumnarTableScan [a#125], InMemoryRelation [a#125,b#126,c#127], true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe, Some(c) sqlContext.sql("select x.b, y.c from c x join c y on x.a = y.a").registerTempTable("d") scala> sqlContext.cacheTable("d") scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain == Physical Plan == Project [b#4] +- SortMergeJoin [c#90], [c#253] :- Sort [c#90 ASC], false, 0 : +- TungstenExchange hashpartitioning(c#90,200), None : +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation [b#4,c#90], true, 10000, StorageLevel(true, true, false, true, 1), Project [b#4,c#90], Some(d) +- Sort [c#253 ASC], false, 0 +- TungstenExchange hashpartitioning(c#253,200), None +- InMemoryColumnarTableScan [c#253], InMemoryRelation [b#246,c#253], true, 10000, StorageLevel(true, true, false, true, 1), Project [b#4,c#90], Some(d) Is the above what you observed ? Cheers On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > This is how the data can be created: > > 1. TableA : cached() > 2. TableB : cached() > 3. TableC: TableA inner join TableB cached() > 4. TableC join TableC does not take the data from cache but starts reading > the data for TableA and TableB from disk. > > Does this sound like a bug? The self join between TableA and TableB are > working fine and taking data from cache. > > > Regards, > Gourav >