Re: Hive Join with distinct rows

Mark Grover Fri, 09 Nov 2012 08:34:44 -0800

I see. I re-read your first email and you would like to query "select all
the unique ID's in T1 which are not in T2"


Query 1 seems to be doing just fine so I would say that's the way to go. I
personally use "IS" operator when comparing something with NULLs instead of
"=".

There are some optimizations you can read about like semi joins that might
come in handy for this query or queries in the future.

https://cwiki.apache.org/Hive/languagemanual-joins.html

Mark


On Fri, Nov 9, 2012 at 8:00 AM, Praveen Kumar K J V S <
[email protected]> wrote:

> Thanks Mark, I do understand that how Hive works with Distinct keyword.
>
> What I was looking for is a solution for my requirement in Hive, I am not
> an expert in SQL, hence looking for suggestions
>
>
> On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover 
> <[email protected]>wrote:
>
>> Hi Praveen,
>> Let's take an example:
>> (from
>> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses
>> )
>>
>> -- Print out contents of the table
>> hive> SELECT col1, col2 FROM t1;
>> 1 3
>> 1 3
>> 1 4
>> 2 5
>>
>> -- Selects distinct col1, col2 tuple
>> hive> SELECT DISTINCT col1, col2 FROM t1;
>> 1 3
>> 1 4
>> 2 5
>>
>>
>> Similar to the second query above, your Query 2 selects each of the
>> distinct values for <id, url, timestamp> tuple possibly giving you multiple
>> records for a given id on the left side of the join. Consequently you don't
>> get the result you expect.
>>
>> Mark
>>
>> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp}
>>>
>>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence
>>> only unique values are present.
>>>
>>> I want to join both tables T1 & T2 such that select all the unique ID's
>>> in T1 which are not in  T2
>>>
>>> I have written 2 queries:
>>>
>>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND
>>> T2.ID=NULL
>>>
>>> this gives me expected results, but with the below I am getting all the
>>> rows in T1
>>>
>>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER
>>> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL
>>>
>>> Can some one point me how to achieve: select all the unique ID's in T1
>>> which are not in T2
>>>
>>> Thanks,
>>> Praveen
>>>
>>>
>>
>

Re: Hive Join with distinct rows

Reply via email to