I see. I re-read your first email and you would like to query "select all the unique ID's in T1 which are not in T2"
Query 1 seems to be doing just fine so I would say that's the way to go. I personally use "IS" operator when comparing something with NULLs instead of "=". There are some optimizations you can read about like semi joins that might come in handy for this query or queries in the future. https://cwiki.apache.org/Hive/languagemanual-joins.html Mark On Fri, Nov 9, 2012 at 8:00 AM, Praveen Kumar K J V S < [email protected]> wrote: > Thanks Mark, I do understand that how Hive works with Distinct keyword. > > What I was looking for is a solution for my requirement in Hive, I am not > an expert in SQL, hence looking for suggestions > > > On Fri, Nov 9, 2012 at 9:54 AM, Mark Grover > <[email protected]>wrote: > >> Hi Praveen, >> Let's take an example: >> (from >> https://cwiki.apache.org/Hive/languagemanual-select.html#LanguageManualSelect-ALLandDISTINCTClauses >> ) >> >> -- Print out contents of the table >> hive> SELECT col1, col2 FROM t1; >> 1 3 >> 1 3 >> 1 4 >> 2 5 >> >> -- Selects distinct col1, col2 tuple >> hive> SELECT DISTINCT col1, col2 FROM t1; >> 1 3 >> 1 4 >> 2 5 >> >> >> Similar to the second query above, your Query 2 selects each of the >> distinct values for <id, url, timestamp> tuple possibly giving you multiple >> records for a given id on the left side of the join. Consequently you don't >> get the result you expect. >> >> Mark >> >> On Thu, Nov 8, 2012 at 6:15 PM, Praveen Kumar K J V S < >> [email protected]> wrote: >> >>> Hi, >>> >>> I have 2 tables, T1 and T2 of structure columns= {ID, url, timestamp} >>> >>> In T1 ID's are repeatable. But T2 ID's are kind of primary key hence >>> only unique values are present. >>> >>> I want to join both tables T1 & T2 such that select all the unique ID's >>> in T1 which are not in T2 >>> >>> I have written 2 queries: >>> >>> Query1: select distinct T1.ID from T1 LEFT OUTER JOIN T2 on T1.ID=T2.IDAND >>> T2.ID=NULL >>> >>> this gives me expected results, but with the below I am getting all the >>> rows in T1 >>> >>> Query2: select distinct(T1.ID), T1.url, T1.timestamp from T1 LEFT OUTER >>> JOIN T2 on T1.ID=T2.ID AND T2.ID=NULL >>> >>> Can some one point me how to achieve: select all the unique ID's in T1 >>> which are not in T2 >>> >>> Thanks, >>> Praveen >>> >>> >> >
