Hi Ted,

The self join works fine on tbales where the hivecontext tables are direct
hive tables, therefore

table1 = hiveContext.sql("select columnA, columnB from hivetable1")
table1.registerTempTable("table1")
table1.cache()
table1.count()

and if I do a self join on table1 things are quite fine

But in case we have something like this:
table1 = hiveContext.sql("select columnA, columnB from hivetable1")
table1.registerTempTable("table1")
table1.cache()
table1.count()

table2 = hiveContext.sql("select columnA, columnB from hivetable2")
table2.registerTempTable("table2")
table2.cache()
table2.count()

table3 = hiveContext.sql("select table1.* from table1 table2 where
table1.columnA = table2.columnA")
table3.registerTempTable("table3")
table3.cache()
table3.count()


then the self join on table3 does not take data from table3 cache, neither
from table1 or table2 cache but starts taking data directly from S3 - which
as you would understand does not make any sense.


Regards,
Gourav





On Wed, Dec 16, 2015 at 7:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I did the following exercise in spark-shell ("c" is cached table):
>
> scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain
> == Physical Plan ==
> Project [b#4]
> +- BroadcastHashJoin [a#3], [a#125], BuildRight
>    :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation [a#3,b#4,c#5],
> true, 10000, StorageLevel(true, true, false, true, 1), ConvertToUnsafe,
> Some(c)
>    +- InMemoryColumnarTableScan [a#125], InMemoryRelation
> [a#125,b#126,c#127], true, 10000, StorageLevel(true, true, false, true, 1),
> ConvertToUnsafe, Some(c)
>
> sqlContext.sql("select x.b, y.c from c x join c y on x.a =
> y.a").registerTempTable("d")
> scala> sqlContext.cacheTable("d")
>
> scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain
> == Physical Plan ==
> Project [b#4]
> +- SortMergeJoin [c#90], [c#253]
>    :- Sort [c#90 ASC], false, 0
>    :  +- TungstenExchange hashpartitioning(c#90,200), None
>    :     +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation
> [b#4,c#90], true, 10000, StorageLevel(true, true, false, true, 1), Project
> [b#4,c#90], Some(d)
>    +- Sort [c#253 ASC], false, 0
>       +- TungstenExchange hashpartitioning(c#253,200), None
>          +- InMemoryColumnarTableScan [c#253], InMemoryRelation
> [b#246,c#253], true, 10000, StorageLevel(true, true, false, true, 1),
> Project [b#4,c#90], Some(d)
>
> Is the above what you observed ?
>
> Cheers
>
> On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> This is how the data  can be created:
>>
>> 1. TableA : cached()
>> 2. TableB : cached()
>> 3. TableC: TableA inner join TableB cached()
>> 4. TableC join TableC does not take the data from cache but starts
>> reading the data for TableA and TableB from disk.
>>
>> Does this sound like a bug? The self join between TableA and TableB are
>> working fine and taking data from cache.
>>
>>
>> Regards,
>> Gourav
>>
>
>

Reply via email to