hi,

I think that people have reported the same issue elsewhere, and this should
be registered as a bug in SPARK

https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html


Regards,
Gourav

On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi Ted,
>
> The self join works fine on tbales where the hivecontext tables are direct
> hive tables, therefore
>
> table1 = hiveContext.sql("select columnA, columnB from hivetable1")
> table1.registerTempTable("table1")
> table1.cache()
> table1.count()
>
> and if I do a self join on table1 things are quite fine
>
> But in case we have something like this:
> table1 = hiveContext.sql("select columnA, columnB from hivetable1")
> table1.registerTempTable("table1")
> table1.cache()
> table1.count()
>
> table2 = hiveContext.sql("select columnA, columnB from hivetable2")
> table2.registerTempTable("table2")
> table2.cache()
> table2.count()
>
> table3 = hiveContext.sql("select table1.* from table1 table2 where
> table1.columnA = table2.columnA")
> table3.registerTempTable("table3")
> table3.cache()
> table3.count()
>
>
> then the self join on table3 does not take data from table3 cache, neither
> from table1 or table2 cache but starts taking data directly from S3 - which
> as you would understand does not make any sense.
>
>
> Regards,
> Gourav
>
>
>
>
>
> On Wed, Dec 16, 2015 at 7:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I did the following exercise in spark-shell ("c" is cached table):
>>
>> scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain
>> == Physical Plan ==
>> Project [b#4]
>> +- BroadcastHashJoin [a#3], [a#125], BuildRight
>>    :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation
>> [a#3,b#4,c#5], true, 10000, StorageLevel(true, true, false, true, 1),
>> ConvertToUnsafe, Some(c)
>>    +- InMemoryColumnarTableScan [a#125], InMemoryRelation
>> [a#125,b#126,c#127], true, 10000, StorageLevel(true, true, false, true, 1),
>> ConvertToUnsafe, Some(c)
>>
>> sqlContext.sql("select x.b, y.c from c x join c y on x.a =
>> y.a").registerTempTable("d")
>> scala> sqlContext.cacheTable("d")
>>
>> scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain
>> == Physical Plan ==
>> Project [b#4]
>> +- SortMergeJoin [c#90], [c#253]
>>    :- Sort [c#90 ASC], false, 0
>>    :  +- TungstenExchange hashpartitioning(c#90,200), None
>>    :     +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation
>> [b#4,c#90], true, 10000, StorageLevel(true, true, false, true, 1), Project
>> [b#4,c#90], Some(d)
>>    +- Sort [c#253 ASC], false, 0
>>       +- TungstenExchange hashpartitioning(c#253,200), None
>>          +- InMemoryColumnarTableScan [c#253], InMemoryRelation
>> [b#246,c#253], true, 10000, StorageLevel(true, true, false, true, 1),
>> Project [b#4,c#90], Some(d)
>>
>> Is the above what you observed ?
>>
>> Cheers
>>
>> On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> This is how the data  can be created:
>>>
>>> 1. TableA : cached()
>>> 2. TableB : cached()
>>> 3. TableC: TableA inner join TableB cached()
>>> 4. TableC join TableC does not take the data from cache but starts
>>> reading the data for TableA and TableB from disk.
>>>
>>> Does this sound like a bug? The self join between TableA and TableB are
>>> working fine and taking data from cache.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>
>>
>

Reply via email to