Hi Mark,
First of all, Happy New Year !
And sorry for the late answer...

I had the hive.optimize.ppd option set to true already so it explains why the 
query works fast :
SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND B.dt=X) WHERE 
A.dt=X AND B.id IS NULL;


I am relatively happy with this query now but nevertheless I feel like this 
situation is not very intuitive and should be documented somewhere.

Cheers,
Michael

-----Message d'origine-----
De : Mark Grover [mailto:mgro...@oanda.com] 
Envoyé : jeudi 22 décembre 2011 19:36
À : user@hive.apache.org
Objet : Re: LEFT OUTER JOIN and partition

Hi Michael,
Thank you for the detailed explanation.

I wouldn't have expected 
SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND A.dt=X AND 
B.dt=X) WHERE B.id IS NULL; 
to give
|A(dt=*)-B(dt=X)| 


Can you try the following?:
set hive.optimize.ppd=true;
SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND B.dt=X) WHERE 
A.dt=X and B.id IS NULL; 

A couple things about the above queries:
1) Depending on what version of Hive you have, you might have Predicate 
Pushdown(ppd) already enabled by default but it doesn't hurt to enable it 
explicitly:-)
2) In left joins, the optimization only happens on the left table (in your 
case, A) and will pull the A.dt=X "up" to filter out the rows before the join 
takes place. 

Let me know what you get.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "mdefoinplatel ext" <mdefoinplatel....@orange.com>
To: user@hive.apache.org
Sent: Thursday, December 22, 2011 11:37:30 AM
Subject: RE: LEFT OUTER JOIN and partition




I have been looking at this problem in more details… 



So let say you have to tables A and B containing distinct Ids and partitioned 
by date. Let say for each partition, B is a subset of A (all the rows in B are 
present in A) 

If you want to compute the size of set difference A – B for a given partition 
dt=X, it is recommend (on this list) to do : 



SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND A.dt=X AND 
B.dt=X) WHERE B.id IS NULL; 



But actually this doesn’t work as expected (at least by me ;) ) and the query 
above does not give you the expected count |A(dt=X)-B(dt=X)| but this one 
|A(dt=*)-B(dt=X)| 



I made some further tests : 



This query works (it gives the expected count) 

SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND A.dt=B.dt) WHERE 
A.dt=X AND B.id IS NULL; 

but it is not optimal since it performs first a join on all the partitions and 
only in the WHERE close it filters out the partition you are interested in. 



This query works for me and is much faster: 

SELECT count(A.id) FROM a LEFT OUTER JOIN b ON (A.id=B.id AND B.dt=X) WHERE 
A.dt=X AND B.id IS NULL; 



Everything works as if the restriction in the JOIN only apply to the RIGHT 
table… 



I don’t know if this is a bug or if my original understanding of the LEFT OUTER 
JOIN was wrong but I am pretty sure I am not the only one falling into that 
trap ! 



Cheers, 

Michael 









De : mdefoinplatel....@orange.com [mailto:mdefoinplatel....@orange.com] 
Envoyé : jeudi 22 décembre 2011 14:29 
À : hive-u...@hadoop.apache.org 
Objet : LEFT OUTER JOIN and partition 



Hi folks, 

Ok, I give you the context first : 



I have seen in the mailing list people advising the use of a … LEFT OUTER JOIN 
ON … followed by a WHERE close to write a NOT IN type of query. 



So to make it clear, it should be possible to rewrite the following query: 

SELECT a.id FROM a WHERE a.id NOT IN (SELECT b.id FROM b); 



Into the following one: 

SELECT a.id FROM a LEFT OUTER JOIN b ON (a.id=b.id) WHERE b.id IS NULL; 



Ok fine, but what is going on in case of partitioned tables ? 



Suppose a and b are partitioned on dt and you want to write the query only for 
a given partition: 

SELECT a.id FROM a LEFT OUTER JOIN b ON (a.id=b.id AND a.dt=X AND b.dt=X) WHERE 
b.id IS NULL; 



As far as I am concerned, the restriction a.dt=X does not produce the expected 
results and I will get rows with a.dt different from X … 

So of course I can restrict the output to the partitions I am interested in 
with a WHERE clause but this looks very odd to me. 



So my questions are : 

1) Am I the only one experience this behavior ? 

2) Since the JOIN operation is performed before the WHERE one then does it 
really means that the join will take place on all the partitions ( in which 
case this type of query will not be tractable in practice) ? 

3) Is there any other way to write a NOT IN query in Hive and more generally 
how to perform a set difference operation in Hive ? 



Sorry for the long email… 



Cheers, 

Michael 



_________________________________________________________________________________________________________________________
 Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites 
ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez 
le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les 
messages electroniques etant susceptibles d'alteration, France Telecom - Orange 
decline toute responsabilite si ce message a ete altere, deforme ou falsifie. 
Merci This message and its attachments may contain confidential or privileged 
information that may be protected by law; they should not be distributed, used 
or copied without authorization. If you have received this email in error, 
please notify the sender and delete this message and its attachments. As emails 
may be altered, France Telecom - Orange shall not be liable if this message was 
modified, changed or falsified. Thank you. 
_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this 
message was modified, changed or falsified.
Thank you. 

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorization.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange shall not be liable if this 
message was modified, changed or falsified.
Thank you.

Reply via email to