I did the following exercise in spark-shell ("c" is cached table):

scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain
== Physical Plan ==
Project [b#4]
+- BroadcastHashJoin [a#3], [a#125], BuildRight
   :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation [a#3,b#4,c#5],
true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe,
Some(c)
   +- InMemoryColumnarTableScan [a#125], InMemoryRelation
[a#125,b#126,c#127], true, 10000, StorageLevel(true, true, false, true, 1),
ConvertToUnsafe, Some(c)

sqlContext.sql("select x.b, y.c from c x join c y on x.a =
y.a").registerTempTable("d")
scala> sqlContext.cacheTable("d")

scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain
== Physical Plan ==
Project [b#4]
+- SortMergeJoin [c#90], [c#253]
   :- Sort [c#90 ASC], false, 0
   :  +- TungstenExchange hashpartitioning(c#90,200), None
   :     +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation
[b#4,c#90], true, 10000, StorageLevel(true, true, false, true, 1), Project
[b#4,c#90], Some(d)
   +- Sort [c#253 ASC], false, 0
      +- TungstenExchange hashpartitioning(c#253,200), None
         +- InMemoryColumnarTableScan [c#253], InMemoryRelation
[b#246,c#253], true, 10000, StorageLevel(true, true, false, true, 1),
Project [b#4,c#90], Some(d)

Is the above what you observed ?

Cheers

On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> This is how the data  can be created:
>
> 1. TableA : cached()
> 2. TableB : cached()
> 3. TableC: TableA inner join TableB cached()
> 4. TableC join TableC does not take the data from cache but starts reading
> the data for TableA and TableB from disk.
>
> Does this sound like a bug? The self join between TableA and TableB are
> working fine and taking data from cache.
>
>
> Regards,
> Gourav
>

Reply via email to